[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773663#action_12773663
 ] 

Michael Busch commented on LUCENE-1879:
---------------------------------------

I realize the current implementation that's attached here is quite
complicated, because it works on top of Lucene's APIs.

However, I really like its flexibility. You can right now easily
rewrite certain parallel indexes without touching others. I use it in
quite different ways. E.g you can easily load one parallel index into a
RAMDirectory or SSD and leave the other ones on the conventional disk.

LUCENE-2025 only optimizes a certain use case of the parallel indexing,
where you want to (re)write a parallel index containing *only* posting
lists and this will especially improve scenarios like Yonik pointed
out a while ago on java-dev where you want to update only a few
documents, not e.g. a certain field for all documents.

In other use cases it is certainly desirable to have a parallel index
that contains a store. It really depends on what data you want to
update individually.

The version of parallel indexing that goes into Lucene's core I
envision quite differently from the current patch here. That's why I'd
like to refactor the IndexWriter (LUCENE-2026) into SegmentWriter and
let's call it IndexManager (the component that controls flushing,
merging, etc.). You can then have a ParallelSegmentWriter, which
partitions the data into parallel segments, and the IndexManager can
behave the same way as before.

You can keep thinking about the whole index as a collection of segments,
just now it will be a matrix of segments instead of a one-dimensional
list.

E.g. the norms could in the future be a parallel segment with a single
column-stride field that you can update by writing a new generation of
the parallel segment.

Things like two-dimensional merge policies will nicely fit into this
model.

Different SegmentWriter implementations will allow you to write single
segments in different ways, e.g. doc-at-a-time (the default one with
addDocument()) or term-at-a-time (like addIndexes*() works).

So I agree we can achieve updating posting lists the way you describe,
but it will be limited to posting lists then. If we allow (re)writing
*segments* in both dimensions I think we will create a more flexible
approach which is independent on what data structures we add to Lucene
- as long as they are not global to the index but per-segment as most
of Lucene's structures are today.

What do you think? Of course I don't want to over-complicate all this,
but if we can get LUCENE-2026 right, I think we can implement parallel
indexing in this segment-oriented way nicely.

> Parallel incremental indexing
> -----------------------------
>
>                 Key: LUCENE-1879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1879
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>             Fix For: 3.1
>
>         Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync 
> on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to