Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Dawid Weiss Tue, 02 Jan 2018 00:41:39 -0800

Only stored fields are kept for each document. If you need to dump
internal data structures (terms, positions, offsets, payloads, you
name it) you'll need to dive into the API and traverse all segments,
then dump the above (and note that document IDs are per-segment and
will have to be somehow consolidated back to your document IDs).


I don't quite understand the motive here -- the indexes should behave
identically regardless of the order of input documents; what's the
point of dumping all this information?

Dawid

On Tue, Jan 2, 2018 at 9:36 AM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
>> How about the quickest solution: dump the content of both indexes to a
> document-per-line text
>
> That would work (and is the plan) but so far I can only get stored
> field per document and no other data on per document basis. What other
> data we can get on per document basis using the Lucene API?
> Chetan Mehrotra
>
>
> On Tue, Jan 2, 2018 at 1:03 PM, Dawid Weiss <dawid.we...@gmail.com> wrote:
>> How about the quickest solution: dump the content of both indexes to a
>> document-per-line text
>> file, sort, diff?
>>
>> Even if your indexes are large, if you have large spare disk, this
>> will be super fast.
>>
>> Dawid
>>
>> On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra
>> <chetan.mehro...@gmail.com> wrote:
>>> Hi,
>>>
>>> We use Lucene for indexing in Jackrabbit Oak [2]. Recently we
>>> implemented a new indexing approach [1] which traverses the data to be
>>> indexed in a different way compared to the traversal approach we have
>>> been using so far. The new approach is faster and produces index with
>>> same number of documents.
>>>
>>> Some notes around index
>>> ------------------------------------
>>>
>>> - The lucene index only has one stored field for ':path' of node in 
>>> repository.
>>> - Content being indexed is unstructured so presence of fields may differ
>>> - Lucene version 4.7.x
>>> - Both approach would index a given node in same way. Its just the
>>> traversal order which differ
>>>
>>> Now we need to compare the index which is produced by earlier approach
>>> with newer one to determine if the generated index is "same". As
>>> indexed data is traversed in different order the documentId would
>>> differ between two indexes and hence the final size differs to some
>>> extent.
>>>
>>> So I would like to implement a logic which can logically compare 2
>>> indexes. One way could be to find if a document with given path in 2
>>> indexes has same fieldNames associated with it. However as fields are
>>> not stored its not possible to determine the fieldNames per document.
>>>
>>> Questions
>>> --------------
>>>
>>> 1. Any way to map field names (not the values) associated with a given 
>>> document
>>> 2. Any other way to logically compare the index data between 2 indexes
>>> which are generated using different approach but index same content.
>>>
>>> Chetan Mehrotra
>>> [1] https://issues.apache.org/jira/browse/OAK-6353
>>> [2] http://jackrabbit.apache.org/oak/docs/query/lucene.html
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Reply via email to