Re: Per-Thread DW and IW

Doron Cohen Wed, 21 Apr 2010 13:17:20 -0700

It is somewhat confusing that "Parallel" in this discussion refers to two
different things - in PI it stands for an index who is sliced into N slices
which are in turn accessed in Parallel, and in PDW it stands for two
document writers which run in parallel, and update the same index. Perhaps
it would be more clear to rename PW to SlicedWriter and similarly
ParallelReader to SlicedReader, then each of them is working on a slice, and
parallelism indicates what is done for speed (although slicing also is for
speed, but in a different manner). This would also remove the confusion
between ParallelReader and ParallelMultiSearcher.


(Side comment/thoughts - If one would have attempted to implement a
SlicedWriter way back when each added document was creating a segment in
memory, and at flush those segments were merged - well, then, a sliced IW
would just create two segments - A and B - out of each document (assuming
two slices) and at flush merge all A's into A* and all B's into B*. Today
added docs are maintained more efficiently, supporting deletions, merge
polices, file-deletion-policy, commit points, crash recovery, NRT and more -
and a sliced DW is more complex than just having two DW's each working on
its part of the document... The simplicity of the old design was a beauty -
reusing the segment concept over and over - though it could not achieve the
nice features of today. Mmm... reading this again not sure that with a
segment per doc things would be really simpler - IW would still need to
manage both....)

Doron

On Wed, Apr 21, 2010 at 8:12 PM, Shai Erera <[email protected]> wrote:

> I don't advocate to develop PI as an external entity to Lucene, you've
> already done that ! :)
>
> We should open up IW enough to develop PI efficiently, but I think we
> should always allow some freedom and flexibility to using applications. If
> IW simply created a Parallel DW, handle the merges on its own as if those
> are just one big happy bunch of Directories, then apps won't be able to plug
> in their own custom IWs, such as a FacetedIW maybe (one which handles the
> facets in the application).
>
> If that 'openness' of IW is the SegmentsWriter API, then that might be
> enough. I imagine apps will want to control things like add/update/delete of
> documents, but it should be IW which controls the MP and MS for all slices
> (you should give your own, but it will be one MP and MS for all slices, and
> not one per slice). Also, methods like addIndexes* probably cannot be
> supported by PI, unless we add a special method signature which accept
> ParallelWriter[] or some such.
>
> Currently, I view SegmentWriter as DocumentWriter, and so I think I'm
> operating under such low-level assumptions. But since I work over IW, some
> things are buried too low. Maybe we should refactor IW first, before PI is
> developed ... any estimates on when PerThread DW is going to be ready? :)
>
> Shai
>
>
> On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <[email protected]> wrote:
>
>> Yeah, sounds like we have the same things in mind here.  In fact, this is
>> pretty similar to what we discussed a while ago on LUCENE-2026 I think.
>>
>> SegmentWriter could be a higher level interface with more than one
>> implementation.  E.g. there could be one SegmentWriter that supports
>> appending documents (i.e. the DocumentsWriter today) and also one that
>> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() does
>> today.  Often when you rewrite entire parallel slices you don't want to use
>> addDocument().  E.g. when you read from a source slice, modify some data and
>> write a new version of that slice it can be dramatically faster to write
>> postinglist after postinglist,  because you avoid parallel I/O and a lot of
>> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual
>> numbers from an implementation I had at IBM...)
>>
>> Further, I imagine to utilize the slice concept within Lucene.  The store
>> could be a separate slice, and so could be the norms and the new flexible
>> scoring data structures.  It's then super easy to turn those off or rewrite
>> them individually (see LUCENE-2025).  Often parallel indexes don't need a
>> store or norms, so this slice concept makes total sense in my opinion.
>>  Norms actually works like this already, you can rewrite them which bumps up
>> their generation number.  We just have to make this concept more abstract,
>> so that it can be used for any kind of slice.
>> Many people have also asked about allowing Lucene to manage external data
>> structures.  I think these changes would allow exactly that:  just implement
>> your external data structure as a slice, and Lucene will call your code when
>> merging, deletions, adds happen. Cool! :)
>>
>> @Shai: If we implement Parallel indexing outside of Lucene's core then we
>> have some of the same drawbacks as with the current master-slave approach.
>>  I'm especially worried about how that would work then with realtime
>> indexing (both searchable RAM buffer and also NRT).  I think PI must be
>> completely segment-aware.  Then it should fit very nicely into realtime
>> indexing, which is also very cool!
>>
>>  Michael
>>
>>
>>
>> On 4/21/10 8:06 AM, Michael McCandless wrote:
>>
>>> I do think the idea of an abstract class (or interface) SegmentWriter
>>> is compelling.
>>>
>>> Each DWPT would be a [single-threaded] SegmentWriter.
>>>
>>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
>>> collection of SegmentWriters, deleting to them, aggregating RAM used
>>> across all, manages picking which ones to flush, etc.).
>>>
>>> Then, a SlicedSegmentWriter (say) would write to separate slices,
>>> single threaded, and then you could make it multi-threaded by wrapping
>>> w/ the above class.
>>>
>>> Though SegmentWriter isn't a great name since it would in general
>>> write to multiple segments.  Indexer is a little too broad though :)
>>>
>>> Something like that maybe?
>>>
>>> Also, allowing an app to directly control the underlying
>>> SegmentWriters inside IndexWriter (instead of letting the
>>> multi-threaded wrapper decide for you) is compelling for way advanced
>>> apps, I think.  EG your app may know it's done indexing from source A
>>> for a while, so, you should right now go and flush it (whereas the
>>> default "flush the one using the most RAM" could leave that source
>>> unflushed for a quite a while, tying up RAM, unless we do some kind of
>>> LRU flushing policy or something).
>>>
>>> Mike
>>>
>>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<[email protected]>  wrote:
>>>
>>>
>>>> I'm not sure that a Parallel DW would work for PI because DW is too
>>>> internal
>>>> to IW. Currently, the approach I've been thinking about for PI is to
>>>> tackle
>>>> it from a high level, e.g. allow the application to pass a Directory, or
>>>> even an IW instance, and PI will play the coordinator role, ensuring
>>>> that
>>>> merge of segments happens across all the slices in accordance,
>>>> implementing
>>>> two-phase operations etc. A Parallel DW then does not fit nicely w/ that
>>>> approach (unless we want to refactor how IW works completely) because DW
>>>> is
>>>> not aware of the Directory, and if PI indeed works over IW instances,
>>>> then
>>>> each will have its own DW.
>>>>
>>>> So there are two basic approaches we can take for PI (following current
>>>> architecture) - either let PI manage IW, or have PI a sort of IW itself,
>>>> which handles events at a much lower level. While the latter is more
>>>> robust
>>>> (and based on current limitations I'm running into, might be even easier
>>>> to
>>>> do), it lacks the flexibility of allowing the app to plug any IW it
>>>> wants.
>>>> That requirement is also important, if the application wants to use PI
>>>> in
>>>> scenarios where it keeps some slices in RAM and some on disk, or it
>>>> wants to
>>>> control more closely which fields go to which slice, so that it can at
>>>> some
>>>> point in time "rebuild" a certain slice outside PI and replace the
>>>> existing
>>>> slice in PI w/ the new one ...
>>>>
>>>> We should probably continue the discussion on PI, so I suggest we either
>>>> move it to another thread or on the issue directly.
>>>>
>>>> Mike - I agree w/ you that we should keep the life of the application
>>>> developers easy and that having IW itself support concurrency is
>>>> beneficial.
>>>> Like I said ... it was just a thought which was aimed at keeping our
>>>> life
>>>> (Lucene developers) easier, but that probably comes second compared to
>>>> app-devs life :). I'm not at all sure also that that would have make our
>>>> life easier ...
>>>>
>>>> So I'm good if you want to drop the discussion.
>>>>
>>>> Shai
>>>>
>>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<[email protected]>
>>>>  wrote:
>>>>
>>>>
>>>>> On 4/19/10 10:25 PM, Shai Erera wrote:
>>>>>
>>>>>
>>>>>> It will definitely simplify multi-threaded handling for IW extensions
>>>>>> like Parallel Index …
>>>>>>
>>>>>>
>>>>>>
>>>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
>>>>> like to introduce parallel DWPTs, that write different slices.
>>>>>  Synchronization should not be a big worry then, because writing is
>>>>> single-threaded.
>>>>>
>>>>> We could introduce a new abstract class SegmentWriter, which DWPT would
>>>>> implement.  An extension would be ParallelSegmentWriter, which would
>>>>> manage
>>>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a
>>>>> better
>>>>> name.
>>>>>
>>>>>  Michael
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: Per-Thread DW and IW

Reply via email to