[
https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570551#comment-14570551
]
Chetan Mehrotra edited comment on OAK-2953 at 6/3/15 9:49 AM:
--------------------------------------------------------------
bq. It's a bit of a hack, re-using the directory structure of the data store
for example, or depending on a CSV file
I am performing indexing on a setup and would gather stats on how the extracted
text size looks like. Given the requirement I do not think I can just use FDS
for reading and need to have a implementation on reading side. For e.g. the
writer keeps a notes of blobId which resulted in empty text/error are stored in
file. So upon reading logic has to check there also and return correct response.
As for CSV that is a feature :)
Tool can directly connect to NodeStore also as of now. However csv feature
would provide more benefit going forward
# If tool uses NodeStore then actual application cannot run (for Segment) as
multiple processes cannot share same nodestore
# CSV file can be split into multiple parts and then the tool can be run on
multiple servers to speed up extraction. Going forward we can enhance it to
merge the results. It avoid coupling to NodeStore hence can be run in parallel
as BlobStore are typically accessible by multiple processes easily
was (Author: chetanm):
bq. It's a bit of a hack, re-using the directory structure of the data store
for example, or depending on a CSV file
I am performing indexing on a setup and would gather stats on how the extracted
text size looks like. Given the requirement I do not think I can just use FDS
for reading and need to have a implementation on reading side. For e.g. the
writer keeps a notes of blobId which resulted in empty text/error are stored in
file. So upon reading logic has to check there also and return correct response.
As for CSV that is a feature :)
Tool can directly connect to NodeStore also as of now. However csv feature
would provide more benefit going forward
# If tool uses NodeStore then actual application cannot run (for Segment) as
multiple processes cannot share same nodestore
# CSV file can be split into multiple parts and then the tool can be run on
multiple servers to speed up extraction. Going forward we can enhance it to
merge the results
> Implement text extractor as part of oak-run
> -------------------------------------------
>
> Key: OAK-2953
> URL: https://issues.apache.org/jira/browse/OAK-2953
> Project: Jackrabbit Oak
> Issue Type: Sub-task
> Components: run
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Fix For: 1.3.0
>
> Attachments: OAK-2953.patch
>
>
> Implement a crawler and indexer which can find out all binary content in
> repository under certain path and extracts text from them and store them
> somewhere
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)