[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Marvin Humphrey (JIRA) Tue, 22 Dec 2009 19:59:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793918#action_12793918
 ]


Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

> Very interesting - thanks. So it also factors in how much the page
> was used in the past, not just how long it's been since the page was
> last used.

In theory, I think that means the term dictionary will tend to be favored over
the posting lists.  In practice... hard to say, it would be difficult to test.
:(

> Even smallish indexes can see the pages swapped out? 

Yes, you're right -- the wait time to get at a small term dictionary isn't
necessarily small.  I've amended my previous post, thanks.

> And of course Java pretty much forces threads-as-concurrency (JVM
> startup time, hotspot compilation, are costly).

Yes.  Java does a lot of stuff that most operating systems can also do, but of
course provides a coherent platform-independent interface.  In Lucy we're
going to try to go back to the OS for some of the stuff that Java likes to
take over -- provided that we can develop a sane genericized interface using
configuration probing and #ifdefs.  

It's nice that as long as the box is up our OS-as-JVM is always running, so we
don't have to worry about its (quite lengthy) startup time. 

> Right, this is how Lucy would force warming.

I think slurp-instead-of-mmap is orthogonal to warming, because we can warm
file-backed RAM structures by forcing them into the IO cache, using either the
cat-to-dev-null trick or something more sophisticated.  The
slurp-instead-of-mmap setting would cause warming as a side effect, but the
main point would be to attempt to persuade the virtual memory system that
certain data structures should have a higher status and not be paged out as
quickly.

> But, even within that CFS file, these three sub-files will not be
> local? Ie you'll still have to hit three pages per "lookup" right?

They'll be next to each other in the compound file because CompoundFileWriter
orders them alphabetically.  For big segments, though, you're right that they
won't be right next to each other, and you could possibly incur as many as
three page faults when retrieving a sort cache value.

But what are the alternatives for variable width data like strings?  You need
the ords array anyway for efficient comparisons, so what's left are the
offsets array and the character data.

An array of String objects isn't going to have better locality than one solid
block of memory dedicated to offsets and another solid block of memory
dedicated to file data, and it's no fewer derefs even if the string object
stores its character data inline -- more if it points to a separate allocation
(like Lucy's CharBuf does, since it's mutable). 

For each sort cache value lookup, you're going to need to access two blocks of
memory.  

  * With the array of String objects, the first is the memory block dedicated
    to the array, and the second is the memory block dedicated to the String
    object itself, which contains the character data.
  * With the file-backed block sort cache, the first memory block is the
    offsets array, and the second is the character data array.

I think the locality costs should be approximately the same... have I missed 
anything?

> Write-once is good for Lucene too.

Hellyeah.
 
> And it seems like Lucy would not need anything crazy-os-specific wrt
> threads?

It depends on how many classes we want to make thread-safe, and it's not just
the OS, it's the host.

The bare minimum is simply to make Lucy thread-safe as a library.  That's
pretty close, because Lucy studiously avoided global variables whenever
possible.  The only problems that have to be addressed are the VTable_registry
Hash, race conditions when creating new subclasses via dynamic VTable
singletons, and refcounts on the VTable objects themselves.

Once those issues are taken care of, you'll be able to use Lucy objects in
separate threads with no problem, e.g. one Searcher per thread.

However, if you want to *share* Lucy objects (other than VTables) across
threads, all of a sudden we have to start thinking about "synchronized",
"volatile", etc.  Such constructs may not be efficient or even possible under
some threading models.

> Hmm I'd guess that field cache is slowish; deleted docs & norms are
> very fast; terms index is somewhere in between.

That jibes with my own experience.  So maybe consider file-backed sort caches
in Lucene, while keeping the status quo for everything else?

> You're right, you'd get two readers for seg_12 in that case. By
> "pool" I meant you're tapping into all the sub-readers that the
> existing reader have opened - the reader is your pool of sub-readers.

Each unique SegReader will also have dedicated "sub-reader" objects: two
"seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12"
PostingsReaders, etc.  However, all those sub-readers will share the same
file-backed RAM data, so in that sense they're pooled.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to