[jira] [Comment Edited] (OAK-2953) Implement text extractor as part of oak-run

Chetan Mehrotra (JIRA) Wed, 03 Jun 2015 02:50:59 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570551#comment-14570551
 ]


Chetan Mehrotra edited comment on OAK-2953 at 6/3/15 9:49 AM:
--------------------------------------------------------------

bq.  It's a bit of a hack, re-using the directory structure of the data store 
for example, or depending on a CSV file

I am performing indexing on a setup and would gather stats on how the extracted 
text size looks like. Given the requirement I do not think I can just use FDS 
for reading and need to have a implementation on reading side. For e.g. the 
writer keeps a notes of blobId which resulted in empty text/error are stored in 
file. So upon reading logic has to check there also and return correct response.

As for CSV that is a feature :)

Tool can directly connect to NodeStore also as of now. However csv feature 
would provide more benefit going forward
# If tool uses NodeStore then actual application cannot run (for Segment) as 
multiple processes cannot share same nodestore
# CSV file can be split into multiple parts and then the tool can be run on 
multiple servers to speed up extraction. Going forward we can enhance it to 
merge the results. It avoid coupling to NodeStore hence can be run in parallel 
as BlobStore are typically accessible by multiple processes easily


was (Author: chetanm):
bq.  It's a bit of a hack, re-using the directory structure of the data store 
for example, or depending on a CSV file

I am performing indexing on a setup and would gather stats on how the extracted 
text size looks like. Given the requirement I do not think I can just use FDS 
for reading and need to have a implementation on reading side. For e.g. the 
writer keeps a notes of blobId which resulted in empty text/error are stored in 
file. So upon reading logic has to check there also and return correct response.

As for CSV that is a feature :)

Tool can directly connect to NodeStore also as of now. However csv feature 
would provide more benefit going forward
# If tool uses NodeStore then actual application cannot run (for Segment) as 
multiple processes cannot share same nodestore
# CSV file can be split into multiple parts and then the tool can be run on 
multiple servers to speed up extraction. Going forward we can enhance it to 
merge the results

> Implement text extractor as part of oak-run
> -------------------------------------------
>
>                 Key: OAK-2953
>                 URL: https://issues.apache.org/jira/browse/OAK-2953
>             Project: Jackrabbit Oak
>          Issue Type: Sub-task
>          Components: run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.0
>
>         Attachments: OAK-2953.patch
>
>
> Implement a crawler and indexer which can find out all binary content in 
> repository under certain path and extracts text  from them and store them 
> somewhere



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (OAK-2953) Implement text extractor as part of oak-run

Reply via email to