[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Michael McCandless (JIRA) Fri, 18 Dec 2009 16:11:43 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792713#action_12792713
 ]


Michael McCandless commented on LUCENE-2026:
--------------------------------------------

{quote}
bq. Again: NRT is not a "specialized reader". It's a normal read-only 
DirectoryReader, just like you'd get from IndexReader.open, with the only 
difference being that it consulted IW to find which segments to open. Plus, 
it's pooled, so that if IW already has a given segment reader open (say because 
deletes were applied or merges are running), it's reused.

Well, it seems to me that those two features make it special - particularly
the pooling of SegmentReaders. You can't take advantage of that outside the
context of IndexWriter:
{quote}

OK so mabye a little special ;) But, really that pooling should be
factored out of IW.  It's not writer specific.

{quote}
bq. Yes, Lucene's approach must be in the same JVM. But we get important gains 
from this - reusing a single reader (the pool), carrying over merged deletions 
directly in RAM (and eventually field cache & norms too - LUCENE-1785).

Exactly. In my view, that's what makes that reader "special": unlike ordinary
Lucene IndexReaders, this one springs into being with its caches already
primed rather than in need of lazy loading.

But to achieve those benefits, you have to mod the index writing process.
{quote}

Mod the index writing, and the reader reopen, to use the shared pool.
The pool in itself isn't writer specific.

Really the pool is just like what you tap into when you call reopen --
that method looks at the current "pool" of already opened segments,
sharing what it can.

bq. Those modifications are not necessary under the Lucy model, because the 
mere act of writing the index stores our data in the system IO cache.

But, that's where Lucy presumably takes a perf hit.  Lucene can share
these in RAM, not usign the filesystem as the intermediary (eg we do
that today with deletions; norms/field cache/eventual CSF can do the
same.)  Lucy must go through the filesystem to share.

{quote}
bq. Instead, Lucy (by design) must do all sharing & access all index data 
through the filesystem (a decision, I think, could be dangerous), which will 
necessarily increase your reopen time.

Dangerous in what sense?

Going through the file system is a tradeoff, sure - but it's pretty nice to
design your low-latency search app free from any concern about whether
indexing and search need to be coordinated within a single process.
Furthermore, if separate processes are your primary concurrency model, going
through the file system is actually mandatory to achieve best performance on a
multi-core box. Lucy won't always be used with multi-threaded hosts.

I actually think going through the file system is dangerous in a different
sense: it puts pressure on the file format spec. The easy way to achieve IPC
between writers and readers will be to dump stuff into one of the JSON files
to support the killer-feature-du-jour - such as what I'm proposing with this
"fsync" key in the snapshot file. But then we wind up with a bunch of crap
cluttering up our index metadata files. I'm determined that Lucy will have a
more coherent file format than Lucene, but with this IPC requirement we're
setting our community up to push us in the wrong direction. If we're not
careful, we could end up with a file format that's an unmaintainable jumble.

But you're talking performance, not complexity costs, right?
{quote}

Mostly I was thinking performance, ie, trusting the OS to make good
decisions about what should be RAM resident, when it has limited
information...

But, also risky is that all important data structures must be
"file-flat", though in practice that doesn't seem like an issue so
far?  The RAM resident things Lucene has -- norms, deleted docs, terms
index, field cache -- seem to "cast" just fine to file-flat.  If we
switched to an FST for the terms index I guess that could get
tricky...

Wouldn't shared memory be possible for process-only concurrent models?
Also, what popular systems/environments have this requirement (only
process level concurrency) today?

It's wonderful that Lucy can startup really fast, but, for most apps
that's not nearly as important as searching/indexing performance,
right?  I mean, you start only once, and then you handle many, many
searches / index many documents, with that process, usually?

{quote}
bq. Maybe in practice that cost is small though... the OS write cache should 
keep everything fresh... but you still must serialize.

Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
and 900 MB worth of sort cache data; opening a fresh searcher and loading all
sort caches takes circa 21 ms.
{quote}

That's fabulously fast!

But you really need to also test search/indexing throughput, reopen time
(I think) once that's online for Lucy...

{quote}
There's room to improve that further - we haven't yet implemented
IndexReader.reopen() - but that was fast enough to achieve what we wanted to
achieve.
{quote}

Is reopen even necessary in Lucy?

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to