[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Marvin Humphrey (JIRA) Mon, 21 Dec 2009 16:26:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431
 ]


Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

> I guess my confusion is what are all the other benefits of using
> file-backed RAM? You can efficiently use process only concurrency
> (though shared memory is technically an option for this too), and you
> have wicked fast open times (but, you still must warm, just like
> Lucene). 

Processes are Lucy's primary concurrency model.  ("The OS is our JVM.")
Making process-only concurrency efficient isn't optional -- it's a *core*
*concern*.

> What else? Oh maybe the ability to inform OS not to cache
> eg the reads done when merging segments. That's one I sure wish
> Lucene could use...

Lightweight searchers mean architectural freedom.  

Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you
need for whatever app architecture you just dreamed up -- then destroy them
just as effortlessly.  Add another worker thread to your search server without
having to consider the RAM requirements of a heavy searcher object.  Create a
command-line app to search a documentation index without worrying about
daemonizing it.  Etc.

If your normal development pattern is a single monolithic Java process, then
that freedom might not mean much to you.  But with their low per-object RAM
requirements and fast opens, lightweight searchers are easy to use within a
lot of other development patterns. For example: lightweight searchers work 
well for maxing out multiple CPU cores under process-only concurrency.

> In exchange you risk the OS making poor choices about what gets
> swapped out (LRU policy is too simplistic... not all pages are created
> equal), 

The Linux virtual memory system, at least, is not a pure LRU.  It utilizes a
page aging algo which prioritizes pages that have historically been accessed
frequently even when they have not been accessed recently:

{panel}
    http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

    The default action when a page is first allocated, is to give it an
    initial age of 3. Each time it is touched (by the memory management
    subsystem) it's age is increased by 3 to a maximum of 20. Each time the
    Kernel swap daemon runs it ages pages, decrementing their age by 1.
{panel}

And while that system may not be ideal from our standpoint, it's still pretty
good.  In general, the operating system's virtual memory scheme is going to
work fine as designed, for us and everyone else, and minimize memory
availability wait times.

When will swapping out the term dictionary be a problem?  

  * For indexes where queries are made frequently, no problem.  
  * Foir systems with plenty of RAM, no problem.  
  * For systems that aren't very busy, no problem.  
  * For small indexes, no problem.  

The only situation we're talking about is infrequent queries against large
indexes on busy boxes where RAM isn't abundant.  Under those circumstances, it
*might* be noticable that Lucy's term dictionary gets paged out somewhat
sooner than Lucene's.

But in general, if the term dictionary gets paged out, so what?  Nobody was
using it.  Maybe nobody will make another query against that index until next
week.  Maybe the OS made the right decision.

OK, so there's a vulnerable bubble where the the query rate against a large
index is neither too fast nor too slow, on busy machines where RAM isn't
abundant.  I don't think that bubble ought to drive major architectural
decisions.

Let me turn your question on its head.  What does Lucene gain in return for
the slow index opens and large process memory footprint of its heavy
searchers?

> I do love how pure the file-backed RAM approach is, but I worry that
> down the road it'll result in erratic search performance in certain
> app profiles.

If necessary, there's a straightforward remedy: slurp the relevant files into
RAM at object construction rather than mmap them.  The rest of the code won't 
know the difference between malloc'd RAM and mmap'd RAM.  The slurped files 
won't take up any more space than the analogous Lucene data structures; more 
likely, they'll take up less.

That's the kind of setting we'd hide away in the IndexManager class rather
than expose as prominent API, and it would be a hint to index components
rather than an edict.

> Yeah, that you need 3 files for the string sort cache is a little
> spooky... that's 3X the chance of a page fault.

Not when using the compound format.

> But the CFS construction must also go through the filesystem (like
> Lucene) right? So you still incur IO load of creating the small
> files, then 2nd pass to consolidate.

Yes.

> I think we may need to largely take "time" out of our programming
> languages, eg switch to much more declarative code, or
> something... wanna port Lucy to Erlang?
> 
> But I'm not sure process only concurrency, sharing only via
> file-backed memory, is the answer either

I think relying heavily on file-backed memory is particularly appropriate for
Lucy because the write-once file format works well with MAP_SHARED memory
segments.  If files were being modified and had to be protected with
semaphores, it wouldn't be as sweet a match.

Focusing on process-only concurrency also works well for Lucy because host
threading models differ substantially and so will only be accessible via a
generalized interface from the Lucy C core.  It will be difficult to tune
threading performance through that layer of indirection -- I'm guessing beyond
the ability of most developers since few will be experts in multiple host
threading models.  In contrast, expertise in process level concurrency will be
easier to come by and to nourish.

> Using Zoie you can make reopen time insanely fast (much faster than I
> think necessary for most apps), but at the expense of some expected
> hit to searching/indexing throughput. I don't think that's the right
> tradeoff for Lucene.

But as Jake pointed out early in the thread, Zoie achieves those insanely fast
reopens without tight coupling to IndexWriter and its components.  The
auxiliary RAM index approach is well proven.

> Do you have any hard numbers on how much time it takes Lucene to load
> from a hot IO cache, populating its RAM resident data structures?

Hmm, I don't spend a lot of time working with Lucene directly, so I might not
be the person most likely to have data like that at my fingertips.  Maybe that
McCandless dude can help you out, he runs a lot of benchmarks.  ;) 

Or maybe ask the Solr folks?  I see them on solr-user all the time talking 
about "MaxWarmingSearchers". ;)

> OK. Then, you are basically pooling your readers  Ie, you do allow
> in-process sharing, but only among readers.

Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
each new segment, but they would be private to each parent PolyReader.  So if
you reopened two IndexReaders at the same time after e.g.  segment "seg_12"
had been added, each would create a new, private SegReader for "seg_12".

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to