> Only stored fields are kept for each document. If you need to dump
> internal data structures (terms, positions, offsets, payloads, you
> name it) you'll need to dive into the API and traverse all segments,
> then dump the above (and note that document IDs are per-segment and
> will have to be somehow consolidated back to your document IDs).

Okie. So this would require deeper understanding of index format.
Would have a look. To start with I was just looking for a way to dump
indexed field names per document and nothing more

/foo/bar|status, lastModified
/foo/baz|status, type

Where path is stored field (primary key) and rest of the stuff are
sorted field names. Then such a file can be generated for both indexes
and diff can be done post sorting

> I don't quite understand the motive here -- the indexes should behave
> identically regardless of the order of input documents; what's the
> point of dumping all this information?

This is because of way indexing logic is given access to the Node
hierarchy. Would try to provide a brief explanation

Jackrabbit Oak provides a hierarchical storage in a tree form where
sub trees can be of specific type.

/content/dam/assets/december/banner.png
  - jcr:primaryType = "app:Asset"
  + jcr:content
    - jcr:primaryType = "app:AssetContent"
    + metadata
      - status = "published"
      - jcr:lastModified = "2009-10-9T21:52:31"
      - app:tags = ["properties:orientation/landscape",
"marketing:interest/product"]
      - comment = "Image for december launch"
      - jcr:title = "December Banner"
      + xmpMM:History
        + 1
          - softwareAgent = "Adobe Photoshop"
          - author = "David"
    + renditions (nt:folder)
      + original (nt:file)
        + jcr:content
          - jcr:data = ...

To access this content Oak provides a NodeStore/NodeState api [1]
which provides way to access the children. The default indexing logic
uses this api to read the content to be indexed and uses index rules
which allow to index content via relative path. For e.g. it would
create a Lucene field status which maps to
jcr:content/metadata/@status (for an index rule for nodes of type
app:Asset).

This mode of access proved to be slow over remote storage like Mongo
specially for full reindexing case. So we implemented a newer approach
where all content was dumped in a flat file (1 node per line) ->
sorted file and then have a NodeState impl over this flat file. This
changes the way how relative paths work and thus there may be some
potential bugs in newer implementation.

Hence we need to validate that indexing using new api produces same
index as using the stable api. For a case both index would have a
document for "/content/dam/assets/december/banner.png" but if newer
impl had some bug then it may not have indexed the "status" field

So I am looking for way where I can map all fieldNames for a given
document. Actual indexed content would be same if both index have
"status" field indexed so we only need to validate fieldnames per
document. Something like

Thanks for reading all this if you have read so far :)

Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java


On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss <dawid.we...@gmail.com> wrote:
> Only stored fields are kept for each document. If you need to dump
> internal data structures (terms, positions, offsets, payloads, you
> name it) you'll need to dive into the API and traverse all segments,
> then dump the above (and note that document IDs are per-segment and
> will have to be somehow consolidated back to your document IDs).
>
> I don't quite understand the motive here -- the indexes should behave
> identically regardless of the order of input documents; what's the
> point of dumping all this information?
>
> Dawid
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to