Re: Per-Thread DW and IW

Shai Erera Fri, 23 Apr 2010 05:38:26 -0700

The big picture includes what you write, but also other usage, such as
loading different slices into memory, introduce the complementary API to
ParallelReader, query a single slice only etc.


Shai

On Thu, Apr 22, 2010 at 5:04 PM, Michael McCandless <
[email protected]> wrote:

> I like the "slice" term, but, can we drop the 'd'?  Ie "SliceWriter"
> and "SliceReader".
>
> BTW, what are the big picture use cases for slices?  Is this just an
> approximation to incremental indexing?  EG if you suddenly need to
> add a new field to all docs in your index, rather than fully
> reindexing all docs, you can just add a new slice?
>
> Or if you need to change the values across some number of docs for a
> single field, it's better to rewrite the entire slice for that field
> than fully reindex those docs?
>
> Mike
>
> On Wed, Apr 21, 2010 at 4:16 PM, Doron Cohen <[email protected]> wrote:
> > It is somewhat confusing that "Parallel" in this discussion refers to two
> > different things - in PI it stands for an index who is sliced into N
> slices
> > which are in turn accessed in Parallel, and in PDW it stands for two
> > document writers which run in parallel, and update the same index.
> Perhaps
> > it would be more clear to rename PW to SlicedWriter and similarly
> > ParallelReader to SlicedReader, then each of them is working on a slice,
> and
> > parallelism indicates what is done for speed (although slicing also is
> for
> > speed, but in a different manner). This would also remove the confusion
> > between ParallelReader and ParallelMultiSearcher.
> > (Side comment/thoughts - If one would have attempted to implement a
> > SlicedWriter way back when each added document was creating a segment in
> > memory, and at flush those segments were merged - well, then, a sliced IW
> > would just create two segments - A and B - out of each document (assuming
> > two slices) and at flush merge all A's into A* and all B's into B*. Today
> > added docs are maintained more efficiently, supporting deletions, merge
> > polices, file-deletion-policy, commit points, crash recovery, NRT and
> more -
> > and a sliced DW is more complex than just having two DW's each working on
> > its part of the document... The simplicity of the old design was a beauty
> -
> > reusing the segment concept over and over - though it could not achieve
> the
> > nice features of today. Mmm... reading this again not sure that with a
> > segment per doc things would be really simpler - IW would still need to
> > manage both....)
> > Doron
> > On Wed, Apr 21, 2010 at 8:12 PM, Shai Erera <[email protected]> wrote:
> >>
> >> I don't advocate to develop PI as an external entity to Lucene, you've
> >> already done that ! :)
> >>
> >> We should open up IW enough to develop PI efficiently, but I think we
> >> should always allow some freedom and flexibility to using applications.
> If
> >> IW simply created a Parallel DW, handle the merges on its own as if
> those
> >> are just one big happy bunch of Directories, then apps won't be able to
> plug
> >> in their own custom IWs, such as a FacetedIW maybe (one which handles
> the
> >> facets in the application).
> >>
> >> If that 'openness' of IW is the SegmentsWriter API, then that might be
> >> enough. I imagine apps will want to control things like
> add/update/delete of
> >> documents, but it should be IW which controls the MP and MS for all
> slices
> >> (you should give your own, but it will be one MP and MS for all slices,
> and
> >> not one per slice). Also, methods like addIndexes* probably cannot be
> >> supported by PI, unless we add a special method signature which accept
> >> ParallelWriter[] or some such.
> >>
> >> Currently, I view SegmentWriter as DocumentWriter, and so I think I'm
> >> operating under such low-level assumptions. But since I work over IW,
> some
> >> things are buried too low. Maybe we should refactor IW first, before PI
> is
> >> developed ... any estimates on when PerThread DW is going to be ready?
> :)
> >>
> >> Shai
> >>
> >> On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <[email protected]>
> wrote:
> >>>
> >>> Yeah, sounds like we have the same things in mind here.  In fact, this
> is
> >>> pretty similar to what we discussed a while ago on LUCENE-2026 I think.
> >>>
> >>> SegmentWriter could be a higher level interface with more than one
> >>> implementation.  E.g. there could be one SegmentWriter that supports
> >>> appending documents (i.e. the DocumentsWriter today) and also one that
> >>> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*()
> does
> >>> today.  Often when you rewrite entire parallel slices you don't want to
> use
> >>> addDocument().  E.g. when you read from a source slice, modify some
> data and
> >>> write a new version of that slice it can be dramatically faster to
> write
> >>> postinglist after postinglist,  because you avoid parallel I/O and a
> lot of
> >>> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual
> >>> numbers from an implementation I had at IBM...)
> >>>
> >>> Further, I imagine to utilize the slice concept within Lucene.  The
> store
> >>> could be a separate slice, and so could be the norms and the new
> flexible
> >>> scoring data structures.  It's then super easy to turn those off or
> rewrite
> >>> them individually (see LUCENE-2025).  Often parallel indexes don't need
> a
> >>> store or norms, so this slice concept makes total sense in my opinion.
> >>>  Norms actually works like this already, you can rewrite them which
> bumps up
> >>> their generation number.  We just have to make this concept more
> abstract,
> >>> so that it can be used for any kind of slice.
> >>> Many people have also asked about allowing Lucene to manage external
> data
> >>> structures.  I think these changes would allow exactly that:  just
> implement
> >>> your external data structure as a slice, and Lucene will call your code
> when
> >>> merging, deletions, adds happen. Cool! :)
> >>>
> >>> @Shai: If we implement Parallel indexing outside of Lucene's core then
> we
> >>> have some of the same drawbacks as with the current master-slave
> approach.
> >>>  I'm especially worried about how that would work then with realtime
> >>> indexing (both searchable RAM buffer and also NRT).  I think PI must be
> >>> completely segment-aware.  Then it should fit very nicely into realtime
> >>> indexing, which is also very cool!
> >>>
> >>>  Michael
> >>>
> >>>
> >>> On 4/21/10 8:06 AM, Michael McCandless wrote:
> >>>>
> >>>> I do think the idea of an abstract class (or interface) SegmentWriter
> >>>> is compelling.
> >>>>
> >>>> Each DWPT would be a [single-threaded] SegmentWriter.
> >>>>
> >>>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
> >>>> collection of SegmentWriters, deleting to them, aggregating RAM used
> >>>> across all, manages picking which ones to flush, etc.).
> >>>>
> >>>> Then, a SlicedSegmentWriter (say) would write to separate slices,
> >>>> single threaded, and then you could make it multi-threaded by wrapping
> >>>> w/ the above class.
> >>>>
> >>>> Though SegmentWriter isn't a great name since it would in general
> >>>> write to multiple segments.  Indexer is a little too broad though :)
> >>>>
> >>>> Something like that maybe?
> >>>>
> >>>> Also, allowing an app to directly control the underlying
> >>>> SegmentWriters inside IndexWriter (instead of letting the
> >>>> multi-threaded wrapper decide for you) is compelling for way advanced
> >>>> apps, I think.  EG your app may know it's done indexing from source A
> >>>> for a while, so, you should right now go and flush it (whereas the
> >>>> default "flush the one using the most RAM" could leave that source
> >>>> unflushed for a quite a while, tying up RAM, unless we do some kind of
> >>>> LRU flushing policy or something).
> >>>>
> >>>> Mike
> >>>>
> >>>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<[email protected]>  wrote:
> >>>>
> >>>>>
> >>>>> I'm not sure that a Parallel DW would work for PI because DW is too
> >>>>> internal
> >>>>> to IW. Currently, the approach I've been thinking about for PI is to
> >>>>> tackle
> >>>>> it from a high level, e.g. allow the application to pass a Directory,
> >>>>> or
> >>>>> even an IW instance, and PI will play the coordinator role, ensuring
> >>>>> that
> >>>>> merge of segments happens across all the slices in accordance,
> >>>>> implementing
> >>>>> two-phase operations etc. A Parallel DW then does not fit nicely w/
> >>>>> that
> >>>>> approach (unless we want to refactor how IW works completely) because
> >>>>> DW is
> >>>>> not aware of the Directory, and if PI indeed works over IW instances,
> >>>>> then
> >>>>> each will have its own DW.
> >>>>>
> >>>>> So there are two basic approaches we can take for PI (following
> current
> >>>>> architecture) - either let PI manage IW, or have PI a sort of IW
> >>>>> itself,
> >>>>> which handles events at a much lower level. While the latter is more
> >>>>> robust
> >>>>> (and based on current limitations I'm running into, might be even
> >>>>> easier to
> >>>>> do), it lacks the flexibility of allowing the app to plug any IW it
> >>>>> wants.
> >>>>> That requirement is also important, if the application wants to use
> PI
> >>>>> in
> >>>>> scenarios where it keeps some slices in RAM and some on disk, or it
> >>>>> wants to
> >>>>> control more closely which fields go to which slice, so that it can
> at
> >>>>> some
> >>>>> point in time "rebuild" a certain slice outside PI and replace the
> >>>>> existing
> >>>>> slice in PI w/ the new one ...
> >>>>>
> >>>>> We should probably continue the discussion on PI, so I suggest we
> >>>>> either
> >>>>> move it to another thread or on the issue directly.
> >>>>>
> >>>>> Mike - I agree w/ you that we should keep the life of the application
> >>>>> developers easy and that having IW itself support concurrency is
> >>>>> beneficial.
> >>>>> Like I said ... it was just a thought which was aimed at keeping our
> >>>>> life
> >>>>> (Lucene developers) easier, but that probably comes second compared
> to
> >>>>> app-devs life :). I'm not at all sure also that that would have make
> >>>>> our
> >>>>> life easier ...
> >>>>>
> >>>>> So I'm good if you want to drop the discussion.
> >>>>>
> >>>>> Shai
> >>>>>
> >>>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<[email protected]>
> >>>>>  wrote:
> >>>>>
> >>>>>>
> >>>>>> On 4/19/10 10:25 PM, Shai Erera wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> It will definitely simplify multi-threaded handling for IW
> extensions
> >>>>>>> like Parallel Index …
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT
> >>>>>> I'd
> >>>>>> like to introduce parallel DWPTs, that write different slices.
> >>>>>>  Synchronization should not be a big worry then, because writing is
> >>>>>> single-threaded.
> >>>>>>
> >>>>>> We could introduce a new abstract class SegmentWriter, which DWPT
> >>>>>> would
> >>>>>> implement.  An extension would be ParallelSegmentWriter, which would
> >>>>>> manage
> >>>>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a
> >>>>>> better
> >>>>>> name.
> >>>>>>
> >>>>>>  Michael
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Per-Thread DW and IW

Reply via email to