[ 
https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Reschke closed OAK-7353.
-------------------------------

> oak-run tika extraction should support getting assistance from stored indexed 
> data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-7353
>                 URL: https://issues.apache.org/jira/browse/OAK-7353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, oak-run
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>             Fix For: 1.10, 1.9.3
>
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing 
> text extraction in parallel so that in can be used ingested later during 
> indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on 
> update of index definition which won't impact extracted data from binaries 
> (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data 
> from older version of index def that can supply extracted text (as it's 
> binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be 
> able to consult an index and pick extracted text from there to fill up text 
> extraction store. Of course, if the index doesn't have data for a given 
> binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to