[jira] [Updated] (OAK-2953) Implement text extractor as part of oak-run

Chetan Mehrotra (JIRA) Wed, 03 Jun 2015 02:20:07 -0700

     [ 
https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chetan Mehrotra updated OAK-2953:
---------------------------------
    Attachment: OAK-2953.patch

[patch|^OAK-2953.patch] implementing the required feature in oak-run. It 
exposes a {{tika}} mode providing options to {{extract}}, {{report}} and 
{{generate}}.

Key points

# Tika parsers are not embedded within oak-run(as they bloat up and change 
frequently). Instead users would need to download the tika-app jar which embeds 
all the required parsers
# Tool relies on a [csv|#csv] file to read the binary metadata and does not 
require direct access to repository. The csv can be generated programatically 
and we provide a Groovy script for Sling based apps
# Extracted text is for now stored on file system similar to how 
{{FileDataStore}} saves the files. This has a downside if extracted text is 
small. This aspect needs to be looked into

/cc [~tmueller] [~alex.parvulescu] Can you review. The patch is bit large. Key 
parts to be reviewed are 
* {{DataStoreTextWriter}} - Manages how the extracted text are stored. Has some 
optimization for binary where extracted text is empty or some error occurred in 
parsing
* {{TextExtractor}} - Performs the extraction using a thread pool

h2. Tika Command
{noformat}
Apache Jackrabbit Oak 1.4-SNAPSHOT
Non-option arguments:                                                         
tika [extract|report|generate]                                                
report   : Generates a summary report related to binary data                  
extract  : Performs the text extraction                                       
generate : Generates the csv data file based on configured NodeStore/BlobStore

Option                 Description                            
------                 -----------                            
-?, -h, --help         show help                              
--data-file <File>     Data file in csv format containing the 
                         binary metadata                      
--fds-path <File>      Path of directory used by FileDataStore
--nodestore            NodeStore detail                       
                         /path/to/oak/repository | mongodb:   
                         //host:port/database                 
--path                 Path in repository under which the     
                         binaries would be searched           
--pool-size <Integer>  Size of the thread pool used to        
                         perform text extraction. Defaults to 
                         number of cores on the system        
--store-path <File>    Path of directory used to store        
                         extracted text content               
--tika-config <File>   Tika config file path   
{noformat}

h3. Report
Tool can generate a summary report from a [csv|#csv] file

bq. java -jar target/oak-run.jar tika --data-file /path/to/binary-stats.csv 
report

The report provides a summary like 
{noformat}
14:39:05.402 [main] INFO  o.a.j.o.p.tika.TextExtractorMain - MimeType Stats
        Total size         : 89.3 MB
        Total indexed size : 3.4 MB
        Total count        : 1048

               Type                 Indexed   Supported    Count       Size   
___________________________________________________________________________________
application/epub+zip              |      true|      true|  1       |    3.4 MB
image/png                         |     false|      true|  544     |   40.2 MB
image/jpeg                        |     false|      true|  444     |   34.0 MB
image/tiff                        |     false|      true|  11      |    6.1 MB
application/x-indesign            |     false|     false|  1       |    3.7 MB
application/octet-stream          |     false|     false|  39      |    1.2 MB
application/x-shockwave-flash     |     false|     false|  4       |  372.2 kB
application/pdf                   |     false|     false|  3       |  168.3 kB
video/quicktime                   |     false|     false|  1       |   95.9 kB
{noformat}

h3. Extraction

Extraction can be performed via following command

bq. java -cp oak-run-1.4-SNAPSHOT.jar:tika-app-1.8.jar 
org.apache.jackrabbit.oak.run.Main tika --data-file binary-stats.csv 
--store-path ./store --fds-path /path/to/datastore  extract

You would need to provide the tika-app jar which contains all the parsers. It 
can be downloaded from [here|https://tika.apache.org/download.html]

{anchor:csv}
h3. CSV File Format
User can provide a CSV file which contains the details about binary
{noformat}
43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/dam/geometrixx-outdoors/activities/jcr:content/folderThumbnail/jcr:content"
43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/dam/geometrixx-outdoors/activities/snowboarding/jcr:content/folderThumbnail/jcr:content"
...
{noformat}

Where the columns are in following order
# BlobId - Value of [Jackrabbit 
ContentIdentity|http://jackrabbit.apache.org/api/2.0/org/apache/jackrabbit/api/JackrabbitValue.html]
# Length
# jcr:mimeType
# jcr:encoding
# path of parent node

If you using Sling then the csv can be generated from [1]

[1] https://gist.github.com/chetanmeh/be66363172532e09ee7d

> Implement text extractor as part of oak-run
> -------------------------------------------
>
>                 Key: OAK-2953
>                 URL: https://issues.apache.org/jira/browse/OAK-2953
>             Project: Jackrabbit Oak
>          Issue Type: Sub-task
>          Components: run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.0
>
>         Attachments: OAK-2953.patch
>
>
> Implement a crawler and indexer which can find out all binary content in 
> repository under certain path and extracts text  from them and store them 
> somewhere



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-2953) Implement text extractor as part of oak-run

Reply via email to