Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable?

Britske Wed, 04 Nov 2009 09:29:53 -0800

Yeah I understand. Thanks anyway, it cleared my head a bit, 

Geert-Jan



Erick Erickson wrote:
> 
> You're right, my comment was irrelevant. Mostly, I try to make sure
> that people aren't asking an "XY problem", That is, asking for how
> to do X when what they really want is Y. And most of the posts
> I've seen wondering about doc IDs were exactly that, but yours
> clearly isn't.
> 
> And I'm going to have to defer any other comments to people who
> know more about it than I do....
> 
> Erick
> 
> On Wed, Nov 4, 2009 at 11:08 AM, Britske <[email protected]> wrote:
> 
>>
>> please ignore the garbage at the end ;-)
>>
>>
>> Britske wrote:
>> >
>> > This issue is related to post: "merging Parallel indexes (can
>> > indexWriter.addIndexesNoOptimize be used?)"
>> >
>> > Among another thing described in the post above, I'm experimenting with
>> a
>> > combination of sharding and vertical partitioning which I feel will
>> > increase my indexing performance a lot, which at the moment is a real
>> > problem. Indexing time is for more than 99% related to a bunch of
>> indexed
>> > fields (+/- 20.000 of them, I know that's a lot) which are all pretty
>> much
>> > related.
>> >
>> > For this I'm considering the following setup:
>> > N boxes will create 2 indexes each: index A containing the 20.000
>> indexed
>> > fields, and index B contains the rest.
>> >
>> > Index B is created using the normal route: indexWriter.addDocument().
>> > But index A will be created using a custom (yet to write) indexer.
>> Since
>> > the indexing client knows a lot of the documents and these particular
>> > fields (basically it can very effciently calculate the inverse indexes
>> for
>> > all these fields and thus more or less directly construct .frq, .tii,
>> > .tis files) I'm pretty sure a lot of time can be gained. That is, once
>> I
>> > figure out the nitty-gritty low level details of writing to these
>> files.
>> > Any help here much appreciated ;-).
>> >
>> > At some point all of these indexes over these boxes have to be merged.
>> > there would be 2 routes: (hypothetical methods)
>> >
>> > 1.
>> > TotalA  = mergeShards(box1.A,...boxN.A)
>> > TotalB  = mergeShards(box1.B,...boxN.B)
>> > Total = MergeVertical(TotalA, TotalB)
>> >
>> > 2.
>> > Total 1  = mergeVertical(box1.A,box1.B)
>> > Total 2  = mergeVertical(box2.A,box2.B)
>> > ...
>> > Total N  = mergeVertical(boxN.A,boxN.B)
>> > Total   = mergeShards(Total1,...TotalN)
>> >
>> >
>> > My question stems from option 1.
>> >
>> > After merging shards TotalA and Total2 should have the same
>> docid-order,
>> > because that's a prereq for doing something like:
>> > docwriter.addIndexesNoOptimize(new ParallelReader(TotalA,TotalB))
>> >
>> > Sadly your suggestion doesn't work in this situation I think.
>> >
>> > However, After having written this I feel option 2 might be better
>> anyway
>> > performance wise, because I have N boxes around which could
>> parallelize:
>> > Total 1  = mergeVertical(box1.A,box1.B)
>> > Total 2  = mergeVertical(box2.A,box2.B)
>> > ...
>> > Total N  = mergeVertical(boxN.A,boxN.B)
>> >
>> > In this situation I don't have to rely on mergeShards to produce a
>> > calculable order of docids, because I do all vertical merges before
>> > merging the shards. Of course for all individual vertical merges docids
>> > have to still be in order but this could be achieved using your
>> > suggestion.
>> >
>> > And advice or thought on if this route would be worth the effort or not
>> is
>> > much appreciated!
>> >
>> > Thanks for clearing my head a bit.
>> >
>> > Geert-Jan
>> >
>> >
>> >
>> >
>> >
>> > Total 1  = mergeVertical(box1.A,box1.B)
>> > TotalB  = mergeShards(box1.B,...boxN.B)
>> > Total = MergeVertical(TotalA, TotalB)
>> >
>> >
>> >
>> >
>> > At some time I want to merge these parallel indexes but need to ensure
>> > that docids are in order.
>> >
>> > I could indeed wait for the first index (which contains all other
>> fields
>> > but the 20.000) to be constructed and optimized and use your suggested
>> > method to go from key --> docid and thus know the order in which I
>> should
>> > add the documents to the second index.
>> > However this requires me to wait for the first
>> >
>> >
>> >
>> > Erick Erickson wrote:
>> >>
>> >> Hmmmm, why do you care? That is, what is it you're trying to do
>> >> that makes this question necessary? There might be a better
>> >> solution than trying to depend on doc IDs.
>> >>
>> >> Because I don't think you can assume that, even if it is deterministic
>> >> with the version you're using now that it would be in some other
>> version,
>> >> Lucene makes no promises here.
>> >>
>> >> All the advice I've ever seen says that if you want to keep track of
>> >> documents, you assign and index your own ID. You can get the
>> >> doc ID from your unique term quite efficiently if you need to.
>> >>
>> >> HTH
>> >> Erick
>> >>
>> >> On Wed, Nov 4, 2009 at 9:23 AM, Britske <[email protected]> wrote:
>> >>
>> >>>
>> >>> Hi,
>> >>>
>> >>> say I have:
>> >>> - Indexreader[] readers = {reader1, reader2, reader3} //containing
>> all
>> >>> different docs
>> >>> - I know the internal docids of documents in reader1, reader2,
>> reader3
>> >>> seperately
>> >>>
>> >>> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on
>> >>> these
>> >>> readers give me a determinstic and calculable set of docids on the
>> >>> documents
>> >>> in the resulting documentWriter?
>> >>>
>> >>> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
>> >>> "The numbers stored in each segment are unique only within the
>> segment,
>> >>> and
>> >>> must be converted before they can be used in a larger context. The
>> >>> standard
>> >>> technique is to allocate each segment a range of values, based on the
>> >>> range
>> >>> of numbers used in that segment. To convert a document number from a
>> >>> segment
>> >>> to an external value, the segment's base document number is added."
>> >>>
>> >>> Does assinging docids in addIndexesNoOptimize work like this?
>> >>> in other words:
>> >>> - docids of docs in reader1 stay the same in indexwriter
>> >>> - docids of docs in reader2 are incremented by reader1.docs.size();
>> >>> - docids of docs in reader3 are incremented by reader1.docs.size() +
>> >>> reader2.docs.size()
>> >>>
>> >>> Thanks,
>> >>> Geert-Jan
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
>> >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26199239.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26200836.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable?

Reply via email to