[
https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julian Reschke closed OAK-7353.
-------------------------------
> oak-run tika extraction should support getting assistance from stored indexed
> data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
> Key: OAK-7353
> URL: https://issues.apache.org/jira/browse/OAK-7353
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene, oak-run
> Reporter: Vikas Saurabh
> Assignee: Vikas Saurabh
> Priority: Major
> Fix For: 1.10, 1.9.3
>
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing
> text extraction in parallel so that in can be used ingested later during
> indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on
> update of index definition which won't impact extracted data from binaries
> (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data
> from older version of index def that can supply extracted text (as it's
> binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be
> able to consult an index and pick extracted text from there to fill up text
> extraction store. Of course, if the index doesn't have data for a given
> binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)