[jira] [Comment Edited] (OAK-7353) oak-run tika extraction should support getting assistance from stored indexed data from a lucene index

Vikas Saurabh (JIRA) Tue, 22 May 2018 05:46:08 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483888#comment-16483888
 ]


Vikas Saurabh edited comment on OAK-7353 at 5/22/18 12:44 PM:
--------------------------------------------------------------

[~tmueller], 
bq. The big risk is that someone ignores this message.
indeed that's a risk - but not "big" afaict... it's surely is possible but 
binaries don't usually get updated at the same path (usually the path has 
uploaded filename in the hierarchy)
bq. Could we store the blob id in the index as well? Then, the "blob id to 
text" file can be generate from the index alone. It increases the index size. 
But if we only extract text for binaries larger than (for example) 4 KB, then 
the added space shouldn't be too large.
I'd prefer to not go this route for a few reasons:
* it pretty much implies that oak version being used at customer's end indexes 
blobs - thus either requiring "some" part to be backported to maybe 1.2 branch 
(not sure how interested would we be for 1.0 though)
* even if right oak version is being used - for this feature to be useful, at 
least one reindexing of a compatible index is required
* imposes somewhat "magic" constraint which we can, at best, document but can't 
really be ensured

bq. I assume in the normal case, accessing the nodes is needed anyway and not 
that slow. What about two options: (a) combine the two steps, (b) just to step 
1, but not step 2. I wouldn't officially document running step 2 and combining 
the files, even thought it could be implemented as a hidden feature (for 
testing).
The only big, imo, issue with combining blob id extraction along with index 
data dump is that blob id dump generation is often currently done with a fake 
ds. Fake ds makes sure that simply requiring a blob id doesn't reach to a 
potentially remote ds (usually s3). BUT, a combined call would then require 
some way to tell the process "use fake ds generally BUT use real ds for a 
particular hierarchy" (particular hierarchy being paths under index def).
Surely, that's doable - but it felt to me that it complicates the usage and 
requires more engineering effort for practically an edge case scenario (binary 
getting updated at the same path)


was (Author: catholicon):
[~tmueller], 
bq. The big risk is that someone ignores this message.
indeed that's a big risk
bq. Could we store the blob id in the index as well? Then, the "blob id to 
text" file can be generate from the index alone. It increases the index size. 
But if we only extract text for binaries larger than (for example) 4 KB, then 
the added space shouldn't be too large.
I'd prefer to not go this route for a few reasons:
* it pretty much implies that oak version being used at customer's end indexes 
blobs - thus either requiring "some" part to be backported to maybe 1.2 branch 
(not sure how interested would we be for 1.0 though)
* even if right oak version is being used - for this feature to be useful, at 
least one reindexing of a compatible index is required
* imposes somewhat "magic" constraint which we can, at best, document but can't 
really be ensured

bq. I assume in the normal case, accessing the nodes is needed anyway and not 
that slow. What about two options: (a) combine the two steps, (b) just to step 
1, but not step 2. I wouldn't officially document running step 2 and combining 
the files, even thought it could be implemented as a hidden feature (for 
testing).
The only big, imo, issue with combining blob id extraction along with index 
data dump is that blob id dump generation is often currently done with a fake 
ds. Fake ds makes sure that simply requiring a blob id doesn't reach to a 
potentially remote ds (usually s3). BUT, a combined call would then require 
some way to tell the process "use fake ds generally BUT use real ds for a 
particular hierarchy" (particular hierarchy being paths under index def).
Surely, that's doable - but it felt to me that it complicates the usage and 
requires more engineering effort for practically an edge case scenario (binary 
getting updated at the same path)

> oak-run tika extraction should support getting assistance from stored indexed 
> data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-7353
>                 URL: https://issues.apache.org/jira/browse/OAK-7353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, oak-run
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing 
> text extraction in parallel so that in can be used ingested later during 
> indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on 
> update of index definition which won't impact extracted data from binaries 
> (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data 
> from older version of index def that can supply extracted text (as it's 
> binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be 
> able to consult an index and pick extracted text from there to fill up text 
> extraction store. Of course, if the index doesn't have data for a given 
> binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OAK-7353) oak-run tika extraction should support getting assistance from stored indexed data from a lucene index

Reply via email to