[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Marvin Humphrey (JIRA) Wed, 23 Dec 2009 10:27:01 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794137#action_12794137
 ]


Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

> we can't give hints to the OS to tell it not to cache certain reads/writes
> (ie segment merging), 

For what it's worth, we haven't really solved that problem in Lucy either.
The sliding window abstraction we wrapped around mmap/MapViewOfFile largely
solved the problem of running out of address space on 32-bit operating
systems.  However, there's currently no way to invoke madvise through Lucy's
IO abstraction layer -- it's a little tricky with compound files.  

Linux, at least, requires that the buffer supplied to madvise be page-aligned.
So, say we're starting off on a posting list, and we want to communicate to
the OS that it should treat the region we're about to read as MADV_SEQUENTIAL.
If the start of the postings file is in the middle of a 4k page and the file
right before it is a term dictionary, we don't want to indicate that that
region should be treated as sequential.

I'm not sure how to solve that problem without violating the encapsulation of
the compound file model.  Hmm, maybe we could store metadata about the virtual
files indicating usage patterns (sequential, random, etc.)?  Since files are
generally part of dedicated data structures whose usage patterns are known at 
index time.

Or maybe we just punt on that use case and worry only about segment merging.  
Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) 
tell
the OS that it's free to recycle any memory pages associated with it?

> Actually why can't ord & offset be one, for the string sort cache?
> Ie, if you write your string data in sort order, then the offsets are
> also in sort order? (I think we may have discussed this already?)

Right, we discussed this on lucy-dev last spring:

    http://markmail.org/message/epc56okapbgit5lw

Incidentally, some of this thread replays our exchange at the top of
LUCENE-1458 from a year ago.  It was fun to go back and reread that: in the
interrim, we've implemented segment-centric search and memory mapped field
caches and term dictionaries, both of which were first discussed back then.
:)

Ords are great for low cardinality fields of all kinds, but become less
efficient for high cardinality primitive numeric fields.  For simplicity's
sake, the prototype implementation of mmap'd field caches in KS always uses
ords.

> You don't want to have to create Lucy's equivalent of the JMM...

The more I think about making Lucy classes thread safe, the harder it seems.
:(  I'd like to make it possible to share a Schema across threads, for
instance, but that means all its Analyzers, etc have to be thread-safe as
well, which isn't practical when you start getting into contributed
subclasses.  

Even if we succeed in getting Folders and FileHandles thread safe, it will be
hard for the user to keep track of what they can and can't do across threads.
"Don't share anything" is a lot easier to understand.

We reap a big benefit by making Lucy's metaclass infrastructure thread-safe.
Beyond that, seems like there's a lot of pain for little gain.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to