The big picture includes what you write, but also other usage, such as loading different slices into memory, introduce the complementary API to ParallelReader, query a single slice only etc.
Shai On Thu, Apr 22, 2010 at 5:04 PM, Michael McCandless < [email protected]> wrote: > I like the "slice" term, but, can we drop the 'd'? Ie "SliceWriter" > and "SliceReader". > > BTW, what are the big picture use cases for slices? Is this just an > approximation to incremental indexing? EG if you suddenly need to > add a new field to all docs in your index, rather than fully > reindexing all docs, you can just add a new slice? > > Or if you need to change the values across some number of docs for a > single field, it's better to rewrite the entire slice for that field > than fully reindex those docs? > > Mike > > On Wed, Apr 21, 2010 at 4:16 PM, Doron Cohen <[email protected]> wrote: > > It is somewhat confusing that "Parallel" in this discussion refers to two > > different things - in PI it stands for an index who is sliced into N > slices > > which are in turn accessed in Parallel, and in PDW it stands for two > > document writers which run in parallel, and update the same index. > Perhaps > > it would be more clear to rename PW to SlicedWriter and similarly > > ParallelReader to SlicedReader, then each of them is working on a slice, > and > > parallelism indicates what is done for speed (although slicing also is > for > > speed, but in a different manner). This would also remove the confusion > > between ParallelReader and ParallelMultiSearcher. > > (Side comment/thoughts - If one would have attempted to implement a > > SlicedWriter way back when each added document was creating a segment in > > memory, and at flush those segments were merged - well, then, a sliced IW > > would just create two segments - A and B - out of each document (assuming > > two slices) and at flush merge all A's into A* and all B's into B*. Today > > added docs are maintained more efficiently, supporting deletions, merge > > polices, file-deletion-policy, commit points, crash recovery, NRT and > more - > > and a sliced DW is more complex than just having two DW's each working on > > its part of the document... The simplicity of the old design was a beauty > - > > reusing the segment concept over and over - though it could not achieve > the > > nice features of today. Mmm... reading this again not sure that with a > > segment per doc things would be really simpler - IW would still need to > > manage both....) > > Doron > > On Wed, Apr 21, 2010 at 8:12 PM, Shai Erera <[email protected]> wrote: > >> > >> I don't advocate to develop PI as an external entity to Lucene, you've > >> already done that ! :) > >> > >> We should open up IW enough to develop PI efficiently, but I think we > >> should always allow some freedom and flexibility to using applications. > If > >> IW simply created a Parallel DW, handle the merges on its own as if > those > >> are just one big happy bunch of Directories, then apps won't be able to > plug > >> in their own custom IWs, such as a FacetedIW maybe (one which handles > the > >> facets in the application). > >> > >> If that 'openness' of IW is the SegmentsWriter API, then that might be > >> enough. I imagine apps will want to control things like > add/update/delete of > >> documents, but it should be IW which controls the MP and MS for all > slices > >> (you should give your own, but it will be one MP and MS for all slices, > and > >> not one per slice). Also, methods like addIndexes* probably cannot be > >> supported by PI, unless we add a special method signature which accept > >> ParallelWriter[] or some such. > >> > >> Currently, I view SegmentWriter as DocumentWriter, and so I think I'm > >> operating under such low-level assumptions. But since I work over IW, > some > >> things are buried too low. Maybe we should refactor IW first, before PI > is > >> developed ... any estimates on when PerThread DW is going to be ready? > :) > >> > >> Shai > >> > >> On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <[email protected]> > wrote: > >>> > >>> Yeah, sounds like we have the same things in mind here. In fact, this > is > >>> pretty similar to what we discussed a while ago on LUCENE-2026 I think. > >>> > >>> SegmentWriter could be a higher level interface with more than one > >>> implementation. E.g. there could be one SegmentWriter that supports > >>> appending documents (i.e. the DocumentsWriter today) and also one that > >>> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() > does > >>> today. Often when you rewrite entire parallel slices you don't want to > use > >>> addDocument(). E.g. when you read from a source slice, modify some > data and > >>> write a new version of that slice it can be dramatically faster to > write > >>> postinglist after postinglist, because you avoid parallel I/O and a > lot of > >>> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual > >>> numbers from an implementation I had at IBM...) > >>> > >>> Further, I imagine to utilize the slice concept within Lucene. The > store > >>> could be a separate slice, and so could be the norms and the new > flexible > >>> scoring data structures. It's then super easy to turn those off or > rewrite > >>> them individually (see LUCENE-2025). Often parallel indexes don't need > a > >>> store or norms, so this slice concept makes total sense in my opinion. > >>> Norms actually works like this already, you can rewrite them which > bumps up > >>> their generation number. We just have to make this concept more > abstract, > >>> so that it can be used for any kind of slice. > >>> Many people have also asked about allowing Lucene to manage external > data > >>> structures. I think these changes would allow exactly that: just > implement > >>> your external data structure as a slice, and Lucene will call your code > when > >>> merging, deletions, adds happen. Cool! :) > >>> > >>> @Shai: If we implement Parallel indexing outside of Lucene's core then > we > >>> have some of the same drawbacks as with the current master-slave > approach. > >>> I'm especially worried about how that would work then with realtime > >>> indexing (both searchable RAM buffer and also NRT). I think PI must be > >>> completely segment-aware. Then it should fit very nicely into realtime > >>> indexing, which is also very cool! > >>> > >>> Michael > >>> > >>> > >>> On 4/21/10 8:06 AM, Michael McCandless wrote: > >>>> > >>>> I do think the idea of an abstract class (or interface) SegmentWriter > >>>> is compelling. > >>>> > >>>> Each DWPT would be a [single-threaded] SegmentWriter. > >>>> > >>>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a > >>>> collection of SegmentWriters, deleting to them, aggregating RAM used > >>>> across all, manages picking which ones to flush, etc.). > >>>> > >>>> Then, a SlicedSegmentWriter (say) would write to separate slices, > >>>> single threaded, and then you could make it multi-threaded by wrapping > >>>> w/ the above class. > >>>> > >>>> Though SegmentWriter isn't a great name since it would in general > >>>> write to multiple segments. Indexer is a little too broad though :) > >>>> > >>>> Something like that maybe? > >>>> > >>>> Also, allowing an app to directly control the underlying > >>>> SegmentWriters inside IndexWriter (instead of letting the > >>>> multi-threaded wrapper decide for you) is compelling for way advanced > >>>> apps, I think. EG your app may know it's done indexing from source A > >>>> for a while, so, you should right now go and flush it (whereas the > >>>> default "flush the one using the most RAM" could leave that source > >>>> unflushed for a quite a while, tying up RAM, unless we do some kind of > >>>> LRU flushing policy or something). > >>>> > >>>> Mike > >>>> > >>>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<[email protected]> wrote: > >>>> > >>>>> > >>>>> I'm not sure that a Parallel DW would work for PI because DW is too > >>>>> internal > >>>>> to IW. Currently, the approach I've been thinking about for PI is to > >>>>> tackle > >>>>> it from a high level, e.g. allow the application to pass a Directory, > >>>>> or > >>>>> even an IW instance, and PI will play the coordinator role, ensuring > >>>>> that > >>>>> merge of segments happens across all the slices in accordance, > >>>>> implementing > >>>>> two-phase operations etc. A Parallel DW then does not fit nicely w/ > >>>>> that > >>>>> approach (unless we want to refactor how IW works completely) because > >>>>> DW is > >>>>> not aware of the Directory, and if PI indeed works over IW instances, > >>>>> then > >>>>> each will have its own DW. > >>>>> > >>>>> So there are two basic approaches we can take for PI (following > current > >>>>> architecture) - either let PI manage IW, or have PI a sort of IW > >>>>> itself, > >>>>> which handles events at a much lower level. While the latter is more > >>>>> robust > >>>>> (and based on current limitations I'm running into, might be even > >>>>> easier to > >>>>> do), it lacks the flexibility of allowing the app to plug any IW it > >>>>> wants. > >>>>> That requirement is also important, if the application wants to use > PI > >>>>> in > >>>>> scenarios where it keeps some slices in RAM and some on disk, or it > >>>>> wants to > >>>>> control more closely which fields go to which slice, so that it can > at > >>>>> some > >>>>> point in time "rebuild" a certain slice outside PI and replace the > >>>>> existing > >>>>> slice in PI w/ the new one ... > >>>>> > >>>>> We should probably continue the discussion on PI, so I suggest we > >>>>> either > >>>>> move it to another thread or on the issue directly. > >>>>> > >>>>> Mike - I agree w/ you that we should keep the life of the application > >>>>> developers easy and that having IW itself support concurrency is > >>>>> beneficial. > >>>>> Like I said ... it was just a thought which was aimed at keeping our > >>>>> life > >>>>> (Lucene developers) easier, but that probably comes second compared > to > >>>>> app-devs life :). I'm not at all sure also that that would have make > >>>>> our > >>>>> life easier ... > >>>>> > >>>>> So I'm good if you want to drop the discussion. > >>>>> > >>>>> Shai > >>>>> > >>>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<[email protected]> > >>>>> wrote: > >>>>> > >>>>>> > >>>>>> On 4/19/10 10:25 PM, Shai Erera wrote: > >>>>>> > >>>>>>> > >>>>>>> It will definitely simplify multi-threaded handling for IW > extensions > >>>>>>> like Parallel Index … > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> I'm keeping Parallel indexing in mind. After we have separate DWPT > >>>>>> I'd > >>>>>> like to introduce parallel DWPTs, that write different slices. > >>>>>> Synchronization should not be a big worry then, because writing is > >>>>>> single-threaded. > >>>>>> > >>>>>> We could introduce a new abstract class SegmentWriter, which DWPT > >>>>>> would > >>>>>> implement. An extension would be ParallelSegmentWriter, which would > >>>>>> manage > >>>>>> multiple SegmentWriters. Or maybe SegmentSliceWriter would be a > >>>>>> better > >>>>>> name. > >>>>>> > >>>>>> Michael > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>>> > >>>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
