Yeah I understand. Thanks anyway, it cleared my head a bit, Geert-Jan
Erick Erickson wrote: > > You're right, my comment was irrelevant. Mostly, I try to make sure > that people aren't asking an "XY problem", That is, asking for how > to do X when what they really want is Y. And most of the posts > I've seen wondering about doc IDs were exactly that, but yours > clearly isn't. > > And I'm going to have to defer any other comments to people who > know more about it than I do.... > > Erick > > On Wed, Nov 4, 2009 at 11:08 AM, Britske <gbr...@gmail.com> wrote: > >> >> please ignore the garbage at the end ;-) >> >> >> Britske wrote: >> > >> > This issue is related to post: "merging Parallel indexes (can >> > indexWriter.addIndexesNoOptimize be used?)" >> > >> > Among another thing described in the post above, I'm experimenting with >> a >> > combination of sharding and vertical partitioning which I feel will >> > increase my indexing performance a lot, which at the moment is a real >> > problem. Indexing time is for more than 99% related to a bunch of >> indexed >> > fields (+/- 20.000 of them, I know that's a lot) which are all pretty >> much >> > related. >> > >> > For this I'm considering the following setup: >> > N boxes will create 2 indexes each: index A containing the 20.000 >> indexed >> > fields, and index B contains the rest. >> > >> > Index B is created using the normal route: indexWriter.addDocument(). >> > But index A will be created using a custom (yet to write) indexer. >> Since >> > the indexing client knows a lot of the documents and these particular >> > fields (basically it can very effciently calculate the inverse indexes >> for >> > all these fields and thus more or less directly construct .frq, .tii, >> > .tis files) I'm pretty sure a lot of time can be gained. That is, once >> I >> > figure out the nitty-gritty low level details of writing to these >> files. >> > Any help here much appreciated ;-). >> > >> > At some point all of these indexes over these boxes have to be merged. >> > there would be 2 routes: (hypothetical methods) >> > >> > 1. >> > TotalA = mergeShards(box1.A,...boxN.A) >> > TotalB = mergeShards(box1.B,...boxN.B) >> > Total = MergeVertical(TotalA, TotalB) >> > >> > 2. >> > Total 1 = mergeVertical(box1.A,box1.B) >> > Total 2 = mergeVertical(box2.A,box2.B) >> > ... >> > Total N = mergeVertical(boxN.A,boxN.B) >> > Total = mergeShards(Total1,...TotalN) >> > >> > >> > My question stems from option 1. >> > >> > After merging shards TotalA and Total2 should have the same >> docid-order, >> > because that's a prereq for doing something like: >> > docwriter.addIndexesNoOptimize(new ParallelReader(TotalA,TotalB)) >> > >> > Sadly your suggestion doesn't work in this situation I think. >> > >> > However, After having written this I feel option 2 might be better >> anyway >> > performance wise, because I have N boxes around which could >> parallelize: >> > Total 1 = mergeVertical(box1.A,box1.B) >> > Total 2 = mergeVertical(box2.A,box2.B) >> > ... >> > Total N = mergeVertical(boxN.A,boxN.B) >> > >> > In this situation I don't have to rely on mergeShards to produce a >> > calculable order of docids, because I do all vertical merges before >> > merging the shards. Of course for all individual vertical merges docids >> > have to still be in order but this could be achieved using your >> > suggestion. >> > >> > And advice or thought on if this route would be worth the effort or not >> is >> > much appreciated! >> > >> > Thanks for clearing my head a bit. >> > >> > Geert-Jan >> > >> > >> > >> > >> > >> > Total 1 = mergeVertical(box1.A,box1.B) >> > TotalB = mergeShards(box1.B,...boxN.B) >> > Total = MergeVertical(TotalA, TotalB) >> > >> > >> > >> > >> > At some time I want to merge these parallel indexes but need to ensure >> > that docids are in order. >> > >> > I could indeed wait for the first index (which contains all other >> fields >> > but the 20.000) to be constructed and optimized and use your suggested >> > method to go from key --> docid and thus know the order in which I >> should >> > add the documents to the second index. >> > However this requires me to wait for the first >> > >> > >> > >> > Erick Erickson wrote: >> >> >> >> Hmmmm, why do you care? That is, what is it you're trying to do >> >> that makes this question necessary? There might be a better >> >> solution than trying to depend on doc IDs. >> >> >> >> Because I don't think you can assume that, even if it is deterministic >> >> with the version you're using now that it would be in some other >> version, >> >> Lucene makes no promises here. >> >> >> >> All the advice I've ever seen says that if you want to keep track of >> >> documents, you assign and index your own ID. You can get the >> >> doc ID from your unique term quite efficiently if you need to. >> >> >> >> HTH >> >> Erick >> >> >> >> On Wed, Nov 4, 2009 at 9:23 AM, Britske <gbr...@gmail.com> wrote: >> >> >> >>> >> >>> Hi, >> >>> >> >>> say I have: >> >>> - Indexreader[] readers = {reader1, reader2, reader3} //containing >> all >> >>> different docs >> >>> - I know the internal docids of documents in reader1, reader2, >> reader3 >> >>> seperately >> >>> >> >>> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on >> >>> these >> >>> readers give me a determinstic and calculable set of docids on the >> >>> documents >> >>> in the resulting documentWriter? >> >>> >> >>> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html: >> >>> "The numbers stored in each segment are unique only within the >> segment, >> >>> and >> >>> must be converted before they can be used in a larger context. The >> >>> standard >> >>> technique is to allocate each segment a range of values, based on the >> >>> range >> >>> of numbers used in that segment. To convert a document number from a >> >>> segment >> >>> to an external value, the segment's base document number is added." >> >>> >> >>> Does assinging docids in addIndexesNoOptimize work like this? >> >>> in other words: >> >>> - docids of docs in reader1 stay the same in indexwriter >> >>> - docids of docs in reader2 are incremented by reader1.docs.size(); >> >>> - docids of docs in reader3 are incremented by reader1.docs.size() + >> >>> reader2.docs.size() >> >>> >> >>> Thanks, >> >>> Geert-Jan >> >>> -- >> >>> View this message in context: >> >>> >> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html >> >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >>> >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26199239.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > -- View this message in context: http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26200836.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org