Re: Fields, Index segments and docIds

Olivier Binda Tue, 29 Apr 2014 00:32:30 -0700

This really help ! I didn't know about MultiReader. This looks likeexactly what I need for 1 & 2


For 3. Remapping docIds would allow me to use them as ids for my data,

instead of having a stored field with my ids (which is usually theofficial recommanded way to do this is lucene)

It may not be a good idea, but as my index is read only...that might beinteresting

(trade off between less speed/complexity ... and maintability) ?

Now, with MultiReader... maybee I don't need docIds remapping

Thanks for the great answer !
Olivier




On 04/29/2014 08:46 AM, Uwe Schindler wrote:

Hi Oliver,

To me it looks like you want to do it much too complicated. It also seems that 
you misunderstood join queries, which seems to be your problem. Comments inside:

My lucene Index is built and stored in a zip file (uncompressed) which is used
as a read-only Directory.

1) At lucene indexing time, is it possible to rewrite the index so that some
fields are only found in some segments Say :

EnglishWords, EnglishVerbs go to Segment 1 GermanWords,
GermanSentences go to Segment 2 French, frenchWines go to Segment 3 ...

You can create the 100% same index structure manually without dealing with Lucene 
internals. Just index every language into a separate index with a separate 
IndexWriter. As those segments are read-only, you can call forceMerge(1) after 
indexing, so those indexes have exactly 1 segment -> every language has one 
single segment.

The only difference is: You would need a separate ZIP file for every language (which is 
what you probably need, because you want to ship "language packs"). Or you have 
to rewrite your ZIP-Directory implementation, to work on subdirectories inside the ZIP 
file.

2) In what file is the index structure written (number of index,
docValues...) ? And, is it possible, to tamper in some way with this Say, in a
Directory implementation...at start of my application, to tell the lucene index
to use this segment or not

If every language is a separate index, just use "new MultiReader(indexReader1, 
indexReader2, indexReader3)" to combine them and query the multiReader. This is the 
identical structure to a single DirectoryReader (which is also handled as a MultiReader 
internally) and therefore has no speed impact.

If 1, 2 were possible, I think that it would allow me to ship my index
in a modular way in my apps (with language packs)
and do join queries as regular queries, with no speed penalty

The "join" keyword seems to be your main misunderstanding. There is no relation between 
join queries and multiple indexes. In Lucene "join" queries are to join between documents 
of different type in the same index! Queryng multiple indexes together is not joining, it is simple 
and very fast (because this is how Lucene was made): Just use the MultiReader approach from above 
to query all indexes at the same time. As a MultiReader with many 1-segments DirectoryReaders is 
identical to a large DirectoryReader with n segments, there is no difference at all.

This is something different:

3) At lucene indexing time, is it possible to remap the docId values  (I saw
some MergeState.mapDocId method...) Say
   0 -> 4
   1 -> 3
   2 -> 1
   3 -> 0
   4 -> 2

If 3 is possible, It would allow me to have some sort of

forward/backward compatibilities with my shipped language packs
and also to have fast implementations for some id related methods

What do you want to do? Why do you want to do this? (please refer to XY-Problem: 
<https://people.apache.org/~hossman/#xyproblem>).

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Fields, Index segments and docIds

Reply via email to