Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Chetan Mehrotra Wed, 03 Jan 2018 01:14:28 -0800

>> So unless you "store" that value
>> with the document as a stored field, you'll have to "uninvert" the
>> index yourself.


That helps and explains why there is no support in std api

> Luke has some capabilities to look at the index at a low level,
> perhaps that could give you some pointers. I think you can pull
> the older branch from here:
> https://github.com/DmitryKey/luke

Thanks for the pointer. It has support for reconstructing the Document
which should be having logic to retrieve non stored field names. Would
have a look.

Chetan Mehrotra


On Tue, Jan 2, 2018 at 8:14 PM, Erick Erickson <[email protected]> wrote:
> Luke has some capabilities to look at the index at a low level,
> perhaps that could give you some pointers. I think you can pull
> the older branch from here:
> https://github.com/DmitryKey/luke
>
> or:
> https://code.google.com/archive/p/luke/
>
> NOTE: This is not a part of Lucene, but an independent project
> so it won't have the same labels.
>
> Best,
> Erick
>
> On Tue, Jan 2, 2018 at 2:06 AM, Dawid Weiss <[email protected]> wrote:
>> Ok. I think you should look at the Java API -- this will give you more
>> clarity of what is actually stored in the index
>> and how to extract it. The thing (I think) you're missing is that an
>> inverted index points in the "other" direction (from a given value to
>> all documents that contained it). So unless you "store" that value
>> with the document as a stored field, you'll have to "uninvert" the
>> index yourself.
>>
>> Dawid
>>
>> On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra
>> <[email protected]> wrote:
>>>> Only stored fields are kept for each document. If you need to dump
>>>> internal data structures (terms, positions, offsets, payloads, you
>>>> name it) you'll need to dive into the API and traverse all segments,
>>>> then dump the above (and note that document IDs are per-segment and
>>>> will have to be somehow consolidated back to your document IDs).
>>>
>>> Okie. So this would require deeper understanding of index format.
>>> Would have a look. To start with I was just looking for a way to dump
>>> indexed field names per document and nothing more
>>>
>>> /foo/bar|status, lastModified
>>> /foo/baz|status, type
>>>
>>> Where path is stored field (primary key) and rest of the stuff are
>>> sorted field names. Then such a file can be generated for both indexes
>>> and diff can be done post sorting
>>>
>>>> I don't quite understand the motive here -- the indexes should behave
>>>> identically regardless of the order of input documents; what's the
>>>> point of dumping all this information?
>>>
>>> This is because of way indexing logic is given access to the Node
>>> hierarchy. Would try to provide a brief explanation
>>>
>>> Jackrabbit Oak provides a hierarchical storage in a tree form where
>>> sub trees can be of specific type.
>>>
>>> /content/dam/assets/december/banner.png
>>>   - jcr:primaryType = "app:Asset"
>>>   + jcr:content
>>>     - jcr:primaryType = "app:AssetContent"
>>>     + metadata
>>>       - status = "published"
>>>       - jcr:lastModified = "2009-10-9T21:52:31"
>>>       - app:tags = ["properties:orientation/landscape",
>>> "marketing:interest/product"]
>>>       - comment = "Image for december launch"
>>>       - jcr:title = "December Banner"
>>>       + xmpMM:History
>>>         + 1
>>>           - softwareAgent = "Adobe Photoshop"
>>>           - author = "David"
>>>     + renditions (nt:folder)
>>>       + original (nt:file)
>>>         + jcr:content
>>>           - jcr:data = ...
>>>
>>> To access this content Oak provides a NodeStore/NodeState api [1]
>>> which provides way to access the children. The default indexing logic
>>> uses this api to read the content to be indexed and uses index rules
>>> which allow to index content via relative path. For e.g. it would
>>> create a Lucene field status which maps to
>>> jcr:content/metadata/@status (for an index rule for nodes of type
>>> app:Asset).
>>>
>>> This mode of access proved to be slow over remote storage like Mongo
>>> specially for full reindexing case. So we implemented a newer approach
>>> where all content was dumped in a flat file (1 node per line) ->
>>> sorted file and then have a NodeState impl over this flat file. This
>>> changes the way how relative paths work and thus there may be some
>>> potential bugs in newer implementation.
>>>
>>> Hence we need to validate that indexing using new api produces same
>>> index as using the stable api. For a case both index would have a
>>> document for "/content/dam/assets/december/banner.png" but if newer
>>> impl had some bug then it may not have indexed the "status" field
>>>
>>> So I am looking for way where I can map all fieldNames for a given
>>> document. Actual indexed content would be same if both index have
>>> "status" field indexed so we only need to validate fieldnames per
>>> document. Something like
>>>
>>> Thanks for reading all this if you have read so far :)
>>>
>>> Chetan Mehrotra
>>> [1] 
>>> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java
>>>
>>>
>>> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss <[email protected]> wrote:
>>>> Only stored fields are kept for each document. If you need to dump
>>>> internal data structures (terms, positions, offsets, payloads, you
>>>> name it) you'll need to dive into the API and traverse all segments,
>>>> then dump the above (and note that document IDs are per-segment and
>>>> will have to be somehow consolidated back to your document IDs).
>>>>
>>>> I don't quite understand the motive here -- the indexes should behave
>>>> identically regardless of the order of input documents; what's the
>>>> point of dumping all this information?
>>>>
>>>> Dawid
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Reply via email to