[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Michael McCandless (JIRA) Sun, 20 Dec 2009 07:42:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792996#action_12792996
 ]


Michael McCandless commented on LUCENE-2026:
--------------------------------------------

{quote}
bq. But, that's where Lucy presumably takes a perf hit. Lucene can share these 
in RAM, not usign the filesystem as the intermediary (eg we do that today with 
deletions; norms/field cache/eventual CSF can do the same.) Lucy must go 
through the filesystem to share.

For a flush(), I don't think there's a significant penalty. The only extra
costs Lucy will pay are the bookkeeping costs to update the file system state
and to create the objects that read the index data. Those are real, but since
we're skipping the fsync(), they're small. As far as the actual data, I don't
see that there's a difference.
{quote}

But everything must go through the filesystem with Lucy...

Eg, with Lucene, deletions are not written to disk until you commit.
Flush doesn't write the del file, merging doesn't, etc.  The deletes
are carried in RAM.  We could (but haven't yet -- NRT turnaround time
is already plenty fast) do the same with norms, field cache, terms
dict index, etc.

{quote}
Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM.

Right, for instance because we generally can't force the OS to pin term
dictionaries in RAM, as discussed a while back. It's not an ideal situation,
but Lucene's approach isn't bulletproof either, since Lucene's term
dictionaries can get paged out too.
{quote}

As long as the page is hot... (in both cases!).

But by using file-backed RAM (not malloc'd RAM), you're telling the OS
it's OK if it chooses to swap it out. Sure, malloc'd RAM can be
swapped out too... but that should be less frequent (and, we can
control this behavior, somewhat, eg swappiness).

It's similar to using a weak v strong reference in java.  By using
file-backed RAM you tell the OS it's fair game for swapping.

{quote}
If we have to fsync(), there'll be a cost, but in Lucene you have to pay that
same cost, too. Lucene expects to get around it with IndexWriter.getReader().
In Lucy, we'll get around it by having you call flush() and then reopen a
reader somewhere, often in another proecess.

In both cases, the availability of fresh data is decoupled from the fsync.
In both cases, the indexing process has to be careful about dropping data
on the floor before a commit() succeeds.
In both cases, it's possible to protect against unbounded corruption by
rolling back to the last commit.
{quote}

The two approaches are basically the same, so, we get the same
features ;)

It's just that Lucy uses the filesystem for sharing, and Lucene shares
through RAM.

bq. We're sure not going to throw away all the advantages of mmap and go back 
to reading data structures into process RAM just because of that.

I guess my confusion is what are all the other benefits of using
file-backed RAM?  You can efficiently use process only concurrency
(though shared memory is technically an option for this too), and you
have wicked fast open times (but, you still must warm, just like
Lucene).  What else?  Oh maybe the ability to inform OS *not* to cache
eg the reads done when merging segments.  That's one I sure wish
Lucene could use...

In exchange you risk the OS making poor choices about what gets
swapped out (LRU policy is too simplistic... not all pages are created
equal), must down cast all data structures to file-flat, must share
everything through the filesystem, (perf hit to NRT).

I do love how pure the file-backed RAM approach is, but I worry that
down the road it'll result in erratic search performance in certain
app profiles.

{quote}
bq. But, also risky is that all important data structures must be "file-flat", 
though in practice that doesn't seem like an issue so far?

It's a constraint. For instance, to support mmap, string sort caches
currently require three "files" each: ords, offsets, and UTF-8 character data.
{quote}

Yeah, that you need 3 files for the string sort cache is a little
spooky... that's 3X the chance of a page fault.

{quote}
The compound file system makes the file proliferation bearable, though. And
it's actually nice in a way to have data structures as named files, strongly
separated from each other and persistent.
{quote}

But the CFS construction must also go through the filesystem (like
Lucene) right?  So you still incur IO load of creating the small
files, then 2nd pass to consolidate.

I agree there's a certain design purity to having the files clearly
separate out the elements of the data structures, but if it means
erratic search performance... function over form?

{quote}
If we were willing to ditch portability, we could cast to arrays of structs in
Lucy - but so far we've just used primitives. I'd like to keep it that way,
since it would be nice if the core Lucy file format was at least theoretically
compatible with a pure Java implementation. But Lucy plugins could break that
rule and cast to structs if desired.
{quote}

Someday we could make a Lucene codec that interacts with a Lucy
index... would be a good exercise to go though to see if the flex API
really is "flex" enough...

{quote}
bq. The RAM resident things Lucene has - norms, deleted docs, terms index, 
field cache - seem to "cast" just fine to file-flat.

There are often benefits to keeping stuff "file-flat", particularly when the
file-flat form is compressed. If we were to expand those sort caches to
string objects, they'd take up more RAM than they do now.
{quote}

We've leaving them as UTF8 by default for Lucene (with the flex
changes).  Still, the terms index once loaded does have silly RAM
overhead... we can cut that back a fair amount though.

{quote}
I think the only significant drawback is security: we can't trust memory
mapped data the way we can data which has been read into process RAM and
checked on the way in. For instance, we need to perform UTF-8 sanity checking
each time a string sort cache value escapes the controlled environment of the
cache reader. If the sort cache value was instead derived from an existing
string in process RAM, we wouldn't need to check it.
{quote}

Sigh, that's a curious downside... so term decode intensive uses
(merging, range queries, I guess maybe term dict lookup) take the
brunt of that hit?

{quote}
bq. If we switched to an FST for the terms index I guess that could get 
tricky...

Hmm, I haven't been following that.
{quote}

There's not much to follow -- it's all just talk at this point.  I
don't think anyone's built a prototype yet ;)

{quote}
Too much work to keep up with those giganto patches for flex indexing,
even though it's a subject I'm intimately acquainted with and deeply
interested in. I plan to look it over when you're done and see if we
can simplify it.
{quote}

And then we'll borrow back your simplifications ;) Lather, rinse,
repeat.

{quote}
bq. Wouldn't shared memory be possible for process-only concurrent models?

IPC is a platform-compatibility nightmare. By restricting ourselves to
communicating via the file system, we save ourselves oodles of
engineering time. And on really boring, frustrating work, to boot.
{quote}

I had assumed so too, but I was surprised that Python's
multiprocessing module exposes a simple API for sharing objects from
parent to forked child.  It's at least a counter example (though, in
all fairness, I haven't looked at the impl ;) ), ie, there seems to be
some hope of containing shared memory under a consistent API.

I'm just pointing out that "going through the filesystem" isn't the
only way to have efficient process-only concurrency.  Shared memory
is another option, but, yes it has tradeoffs.

{quote}

bq. Also, what popular systems/environments have this requirement (only process 
level concurrency) today?

Perl's threads suck. Actually all threads suck. Perl's are just worse than
average - and so many Perl binaries are compiled without them. Java threads
suck less, but they still suck - look how much engineering time you folks
blow on managing that stuff. Threads are a terrible programming model.

I'm not into the idea of forcing Lucy users to use threads. They should be
able to use processes as their primary concurrency model if they want.
{quote}

Yes, working with threads is a nightmare (eg have a look at Java's
memory model).  I think the jury is still out (for our species) just
how, long term, we'll make use of concurrency with the machines.  I
think we may need to largely take "time" out of our programming
languages, eg switch to much more declarative code, or
something... wanna port Lucy to Erlang?

But I'm not sure process only concurrency, sharing only via
file-backed memory, is the answer either ;)

{quote}
bq. It's wonderful that Lucy can startup really fast, but, for most apps that's 
not nearly as important as searching/indexing performance, right?

Depends.

Total indexing throughput in both Lucene and KinoSearch has been pretty decent
for a long time. However, there's been a large gap between average index
update performance and worst case index update performance, especially when
you factor in sort cache loading. There are plenty of applications that may
not have very high throughput requirements but where it may not be acceptable
for an index update to take several seconds or several minutes every once in a
while, even if it usually completes faster.

bq. I mean, you start only once, and then you handle many, many searches / 
index many documents, with that process, usually?

Sometimes the person who just performed the action that updated the index is
the only one you care about. For instance, to use a feature request that came
in from Slashdot a while back, if someone leaves a comment on your website,
it's nice to have it available in the search index right away.

Consistently fast index update responsiveness makes personalization of the
customer experience easier.
{quote}

Turnaround time for Lucene NRT is already very fast, as is.  After an
immense merge, it'll be the worst, but if you warm the reader first,
that won't be an issue.

Using Zoie you can make reopen time insanely fast (much faster than I
think necessary for most apps), but at the expense of some expected
hit to searching/indexing throughput.  I don't think that's the right
tradeoff for Lucene.

I suspect Lucy is making a similar tradeoff, ie, that search
performance will be erratic due to page faults, at a smallish gain in
reopen time.

Do you have any hard numbers on how much time it takes Lucene to load
from a hot IO cache, populating its RAM resident data structures?  I
wonder in practice what extra cost we are really talking about... it's
RAM to RAM "translation" of data structures (if the files are hot).
FieldCache we just have to fix to stop doing uninversion... (ie we
need CSF).

{quote}
bq. Is reopen even necessary in Lucy?

Probably. If you have a boatload of segments and a boatload of fields, you
might start to see file opening and metadata parsing costs come into play. If
it turns out that for some indexes reopen() can knock down the time from say,
100 ms to 10 ms or less, I'd consider that sufficient justification.

{quote}

OK.  Then, you are basically pooling your readers ;) Ie, you do allow
in-process sharing, but only among readers.


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to