[jira] [Commented] (OAK-7353) oak-run tika extraction should support getting assistance from stored indexed data from a lucene index

Thomas Mueller (JIRA) Tue, 22 May 2018 04:53:22 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483828#comment-16483828
 ]


Thomas Mueller commented on OAK-7353:
-------------------------------------

> "Please ensure that no binaries are updated between csv generation and index 
>dump steps"

The big risk is that someone ignores this message.

Some ideas:

* Could we store the blob id in the index as well? Then, the "blob id to text" 
file can be generate from the index alone. It increases the index size. But if 
we only extract text for binaries larger than (for example) 4 KB, then the 
added space shouldn't be too large.

* If not: I assume in the normal case, accessing the nodes is needed anyway and 
not that slow. What about two options: (a) combine the two steps, (b) just to 
step 1, but not step 2. I wouldn't officially document running step 2 and 
combining the files, even thought it could be implemented as a hidden feature 
(for testing).

 

> oak-run tika extraction should support getting assistance from stored indexed 
> data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-7353
>                 URL: https://issues.apache.org/jira/browse/OAK-7353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, oak-run
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing 
> text extraction in parallel so that in can be used ingested later during 
> indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on 
> update of index definition which won't impact extracted data from binaries 
> (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data 
> from older version of index def that can supply extracted text (as it's 
> binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be 
> able to consult an index and pick extracted text from there to fill up text 
> extraction store. Of course, if the index doesn't have data for a given 
> binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OAK-7353) oak-run tika extraction should support getting assistance from stored indexed data from a lucene index

Reply via email to