[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773663#action_12773663 ]
Michael Busch commented on LUCENE-1879: --------------------------------------- I realize the current implementation that's attached here is quite complicated, because it works on top of Lucene's APIs. However, I really like its flexibility. You can right now easily rewrite certain parallel indexes without touching others. I use it in quite different ways. E.g you can easily load one parallel index into a RAMDirectory or SSD and leave the other ones on the conventional disk. LUCENE-2025 only optimizes a certain use case of the parallel indexing, where you want to (re)write a parallel index containing *only* posting lists and this will especially improve scenarios like Yonik pointed out a while ago on java-dev where you want to update only a few documents, not e.g. a certain field for all documents. In other use cases it is certainly desirable to have a parallel index that contains a store. It really depends on what data you want to update individually. The version of parallel indexing that goes into Lucene's core I envision quite differently from the current patch here. That's why I'd like to refactor the IndexWriter (LUCENE-2026) into SegmentWriter and let's call it IndexManager (the component that controls flushing, merging, etc.). You can then have a ParallelSegmentWriter, which partitions the data into parallel segments, and the IndexManager can behave the same way as before. You can keep thinking about the whole index as a collection of segments, just now it will be a matrix of segments instead of a one-dimensional list. E.g. the norms could in the future be a parallel segment with a single column-stride field that you can update by writing a new generation of the parallel segment. Things like two-dimensional merge policies will nicely fit into this model. Different SegmentWriter implementations will allow you to write single segments in different ways, e.g. doc-at-a-time (the default one with addDocument()) or term-at-a-time (like addIndexes*() works). So I agree we can achieve updating posting lists the way you describe, but it will be limited to posting lists then. If we allow (re)writing *segments* in both dimensions I think we will create a more flexible approach which is independent on what data structures we add to Lucene - as long as they are not global to the index but per-segment as most of Lucene's structures are today. What do you think? Of course I don't want to over-complicate all this, but if we can get LUCENE-2026 right, I think we can implement parallel indexing in this segment-oriented way nicely. > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync > on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org