[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Michael McCandless (JIRA) Tue, 22 Dec 2009 11:11:19 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793737#action_12793737
 ]


Michael McCandless commented on LUCENE-2026:
--------------------------------------------


{quote}
Processes are Lucy's primary concurrency model. ("The OS is our JVM.")
Making process-only concurrency efficient isn't optional - it's a core
concern.
{quote}

OK

{quote}
Lightweight searchers mean architectural freedom.

Create 2, 10, 100, 1000 Searchers without a second thought - as many as you
need for whatever app architecture you just dreamed up - then destroy them
just as effortlessly. Add another worker thread to your search server without
having to consider the RAM requirements of a heavy searcher object. Create a
command-line app to search a documentation index without worrying about
daemonizing it. Etc.
{quote}

This is definitely neat.

{quote}
The Linux virtual memory system, at least, is not a pure LRU. It utilizes a
page aging algo which prioritizes pages that have historically been accessed
frequently even when they have not been accessed recently:

http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html
{quote}

Very interesting -- thanks.  So it also factors in how much the page
was used in the past, not just how long it's been since the page was
last used.

{quote}
When will swapping out the term dictionary be a problem?

For indexes where queries are made frequently, no problem.
Foir systems with plenty of RAM, no problem.
For systems that aren't very busy, no problem.
For small indexes, no problem.
The only situation we're talking about is infrequent queries against large
indexes on busy boxes where RAM isn't abundant. Under those circumstances, it
might be noticable that Lucy's term dictionary gets paged out somewhat
sooner than Lucene's.
{quote}

Even smallish indexes can see the pages swapped out?  I'd think at
low-to-moderate search traffic, any index could be at risk, depdending
on whether other stuff in the machine wanting RAM or IO cache is
running.

{quote}
But in general, if the term dictionary gets paged out, so what? Nobody was
using it. Maybe nobody will make another query against that index until next
week. Maybe the OS made the right decision.
{quote}

You can't afford many page faults until the latency becomes very
apparent (until we're all on SSDs... at which point this may all be
moot).

Right -- the metric that the swapper optimizes is overall efficient
use of the machine's resources.

But I think that's often a poor metric for search apps... I think
consistency on the search latency is more important, though I agree it
depends very much on the app.

I don't like the same behavior in my desktop -- when I switch to my
mail client, I don't want to wait 10 seconds for it to swap the pages
back in.

{quote}
Let me turn your question on its head. What does Lucene gain in return for
the slow index opens and large process memory footprint of its heavy
searchers?
{quote}

Consistency in the search time.  Assuming the OS doesn't swap our
pages out...

And of course Java pretty much forces threads-as-concurrency (JVM
startup time, hotspot compilation, are costly).

{quote}
If necessary, there's a straightforward remedy: slurp the relevant files into
RAM at object construction rather than mmap them. The rest of the code won't 
know the difference between malloc'd RAM and mmap'd RAM. The slurped files 
won't take up any more space than the analogous Lucene data structures; more 
likely, they'll take up less.

That's the kind of setting we'd hide away in the IndexManager class rather
than expose as prominent API, and it would be a hint to index components
rather than an edict.
{quote}

Right, this is how Lucy would force warming.

{quote}
bq. Yeah, that you need 3 files for the string sort cache is a little spooky... 
that's 3X the chance of a page fault.

Not when using the compound format.
{quote}

But, even within that CFS file, these three sub-files will not be
local?  Ie you'll still have to hit three pages per "lookup" right?

{quote}  
I think relying heavily on file-backed memory is particularly appropriate for
Lucy because the write-once file format works well with MAP_SHARED memory
segments. If files were being modified and had to be protected with
semaphores, it wouldn't be as sweet a match.
{quote}  

Write-once is good for Lucene too.

{quote}
Focusing on process-only concurrency also works well for Lucy because host
threading models differ substantially and so will only be accessible via a
generalized interface from the Lucy C core. It will be difficult to tune
threading performance through that layer of indirection - I'm guessing beyond
the ability of most developers since few will be experts in multiple host
threading models. In contrast, expertise in process level concurrency will be
easier to come by and to nourish.
{quote}

I'm confused by this -- eg Python does a great job presenting a simple
threads interface and implementing it on major OSs.  And it seems like
Lucy would not need anything crazy-os-specific wrt threads?

{quote}
bq. Do you have any hard numbers on how much time it takes Lucene to load from 
a hot IO cache, populating its RAM resident data structures?

Hmm, I don't spend a lot of time working with Lucene directly, so I might not
be the person most likely to have data like that at my fingertips. Maybe that
McCandless dude can help you out, he runs a lot of benchmarks.  
{quote}

Hmm ;) I'd guess that field cache is slowish; deleted docs & norms are
very fast; terms index is somewhere in between.

bq. Or maybe ask the Solr folks? I see them on solr-user all the time talking 
about "MaxWarmingSearchers". 

Hmm -- not sure what's up with that.  Looks like maybe it's the
auto-warming that might happen after a commit.

{quote}
bq. OK. Then, you are basically pooling your readers Ie, you do allow 
in-process sharing, but only among readers.

Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
each new segment, but they would be private to each parent PolyReader. So if
you reopened two IndexReaders at the same time after e.g. segment "seg_12"
had been added, each would create a new, private SegReader for "seg_12".
{quote}

You're right, you'd get two readers for seg_12 in that case.  By
"pool" I meant you're tapping into all the sub-readers that the
existing reader have opened -- the reader is your pool of sub-readers.


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to