[ 
https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480330#comment-16480330
 ] 

Thomas Mueller edited comment on OAK-7353 at 5/22/18 9:46 AM:
--------------------------------------------------------------

Worked on the idea a bit more with [[email protected]] and here are a few 
thoughts.
h5. Constraints
h6. Consistency between csv and index

Csv provides us with a pairing of blob id and path mapping. Dumped index would 
contain extracted binary output for some path.

Combining these 2 implies that we're assuming that a binary hasn't changed in a 
path after it had been indexed.

One way to make sure that this assumption holds is to tie csv generation and 
dumping of indexed data in a single command (hence using same state of the 
repository for both ends). But that poses at least 2 problems:
 * while generating csv we often would want usage of a fake DS to avoid 
reaching out to a remote DS just to read in blob ids. BUT, index dump requires 
real blob id. So, we'd need to improve fake DS impl to fallback to real DS for 
index path.
 * forcing to couple these 2 steps means that csv generation needs to take 
whenever we want to dump index for this usage (csv generation requires 
repository traversal and hence can be slow)

Otoh, binaries are not updated very often in real world cases - so, we can 
simply add a disclaimer such as "Please ensure that no binaries are updated 
between csv generation and index dump steps". (I and [[email protected]] seem 
to tend towards this option.)
h6. Which index is suitable for such optimization

Extracted text for a binary is index as stored field {{:fulltext}}. Aggregate 
rules or {{nodeScopeIndex}} -ed property definitions would be the ones that 
would get this prepared in most cases. It's possible to have an index 
definition with multiple aggregate rules (combined with different nodetypes as 
well) which would extract binary from a relative path under indexed node. 
There's no way to distiguish which part of {{:fulltext}} data is coming from 
which relative path.

So, to simplify things, we'd only support indexes extracting binary on the same 
path where the binary is stored. Iow, we'd only extract stored binary from 
index if path in csv fetches a stored {{:fulltext}} field for the same path in 
index.

A couple of examples of index which can be used would look something like:
{noformat}
+ /oak:index/usableIndex1
  ...
  + indexRules
    ...
    + nt:resource
      + properties
        ...
        + binary
          - name="jcr:data"
          - nodeScopeIndex=true

+ /oak:index/usableIndex2
  ...
  + aggregates
    ...
    + nt:resource
      + include0
        - path="*"
{noformat}
[~chetanm], can you please double check if aggregate rule in {{usableIndex2}} 
is indeed what we'd expect.
h5. Steps
 # Prepare CSV using Step2 in [0]
 # Dump some compatible index (as described above) using [1]. Use 
{{--index-paths}} option to dump only the required index.
 # Use feature from this issue to prepare text store by pulling in data from 
index for blobs pointed to in csv
 # Run classic tika based text extraction for binaries which might not be part 
of the index (Currently Step3 in [0])

h5. Extra notes

Classic tika based text extraction prepares a FDS like structure to store 
extracted data. Along with that, it also outputs 2 metadata files - 
{{blobs_error.txt}} and {{blobs_empty.txt}} for marking which blobs, while 
extracting, threw an error or produced empty output respectively. This is done 
to save time when we prepare text extraction store incrementally.
 In approach used by this issue, we would populate {{blobs_empty.txt}} on the 
same lines as classic extraction BUT we'd avoid populating {{blobs_error.txt}} 
because it could be the case that a given binary is not indexed by the index 
which is feeding in the extracted text OR if the index being used doesn't quite 
comply with the constraints we outlined above. Populating {{blobs_error.txt}} 
would not allow even classic text extraction to do extraction for genuine 
binaries not present in the provided indexed data.

[[email protected]], [~tmueller], [~chetanm], please share your thoughts.

[0]: [https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html]
 [1]: 
[https://jackrabbit.apache.org/oak/docs/query/oak-run-indexing.html#async-index-data]


was (Author: catholicon):
Worked on the idea a bit more with [[email protected]] and here are a few 
thoughts.
h5. Constraints
h6. Consistency between csv and index

Csv provides us with a pairing of blob id and path mapping. Dumped index would 
contain extracted binary output for some path.

Combining these 2 implies that we're assuming that a binary hasn't changed in a 
path after it had been indexed.

One way to make sure that this assumption holds is to tie csv generation and 
dumping of indexed data in a single command (hence using same state of the 
repository for both ends). But that poses at least 2 problems:
 * while generating csv we often would want usage of a fake DS to avoid 
reaching out to a remote DS just to read in blob ids. BUT, index dump requires 
real blob id. So, we'd need to improve fake DS impl to fallback to real DS for 
index path.
 * forcing to couple these 2 steps means that csv generation needs to take 
whenever we want to dump index for this usage (csv generation requires 
repository traversal and hence can be slow)

Otoh, binaries are not updated very often in real world cases - so, we can 
simply add a disclaimer such as "Please ensure that no binaries are updated 
between csv generation and index dump steps". (I and [[email protected]] seem 
to tend towards this option.)
h6. Which index is suitable for such optimization

Extracted text for a binary is index as stored field {{:fulltext}}. Aggregate 
rules or {{nodeScopeIndex}}ed property definitions would be the ones that would 
get this prepared in most cases. It's possible to have an index definition with 
multiple aggregate rules (combined with different nodetypes as well) which 
would extract binary from a relative path under indexed node. There's no way to 
distiguish which part of {{:fulltext}} data is coming from which relative path.

So, to simplify things, we'd only support indexes extracting binary on the same 
path where the binary is stored. Iow, we'd only extract stored binary from 
index if path in csv fetches a stored {{:fulltext}} field for the same path in 
index.

A couple of examples of index which can be used would look something like:
{noformat}
+ /oak:index/usableIndex1
  ...
  + indexRules
    ...
    + nt:resource
      + properties
        ...
        + binary
          - name="jcr:data"
          - nodeScopeIndex=true

+ /oak:index/usableIndex2
  ...
  + aggregates
    ...
    + nt:resource
      + include0
        - path="*"
{noformat}
[~chetanm], can you please double check if aggregate rule in {{usableIndex2}} 
is indeed what we'd expect.
h5. Steps
 # Prepare CSV using Step2 in [0]
 # Dump some compatible index (as described above) using [1]. Use 
{{--index-paths}} option to dump only the required index.
 # Use feature from this issue to prepare text store by pulling in data from 
index for blobs pointed to in csv
 # Run classic tika based text extraction for binaries which might not be part 
of the index (Currently Step3 in [0])

h5. Extra notes

Classic tika based text extraction prepares a FDS like structure to store 
extracted data. Along with that, it also outputs 2 metadata files - 
{{blobs_error.txt}} and {{blobs_empty.txt}} for marking which blobs, while 
extracting, threw an error or produced empty output respectively. This is done 
to save time when we prepare text extraction store incrementally.
 In approach used by this issue, we would populate {{blobs_empty.txt}} on the 
same lines as classic extraction BUT we'd avoid populating {{blobs_error.txt}} 
because it could be the case that a given binary is not indexed by the index 
which is feeding in the extracted text OR if the index being used doesn't quite 
comply with the constraints we outlined above. Populating {{blobs_error.txt}} 
would not allow even classic text extraction to do extraction for genuine 
binaries not present in the provided indexed data.

[[email protected]], [~tmueller], [~chetanm], please share your thoughts.

[0]: [https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html]
 [1]: 
[https://jackrabbit.apache.org/oak/docs/query/oak-run-indexing.html#async-index-data]

> oak-run tika extraction should support getting assistance from stored indexed 
> data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-7353
>                 URL: https://issues.apache.org/jira/browse/OAK-7353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, oak-run
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing 
> text extraction in parallel so that in can be used ingested later during 
> indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on 
> update of index definition which won't impact extracted data from binaries 
> (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data 
> from older version of index def that can supply extracted text (as it's 
> binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be 
> able to consult an index and pick extracted text from there to fill up text 
> extraction store. Of course, if the index doesn't have data for a given 
> binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to