[
https://issues.apache.org/jira/browse/OAK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chetan Mehrotra updated OAK-2953:
---------------------------------
Attachment: OAK-2953.patch
[patch|^OAK-2953.patch] implementing the required feature in oak-run. It
exposes a {{tika}} mode providing options to {{extract}}, {{report}} and
{{generate}}.
Key points
# Tika parsers are not embedded within oak-run(as they bloat up and change
frequently). Instead users would need to download the tika-app jar which embeds
all the required parsers
# Tool relies on a [csv|#csv] file to read the binary metadata and does not
require direct access to repository. The csv can be generated programatically
and we provide a Groovy script for Sling based apps
# Extracted text is for now stored on file system similar to how
{{FileDataStore}} saves the files. This has a downside if extracted text is
small. This aspect needs to be looked into
/cc [~tmueller] [~alex.parvulescu] Can you review. The patch is bit large. Key
parts to be reviewed are
* {{DataStoreTextWriter}} - Manages how the extracted text are stored. Has some
optimization for binary where extracted text is empty or some error occurred in
parsing
* {{TextExtractor}} - Performs the extraction using a thread pool
h2. Tika Command
{noformat}
Apache Jackrabbit Oak 1.4-SNAPSHOT
Non-option arguments:
tika [extract|report|generate]
report : Generates a summary report related to binary data
extract : Performs the text extraction
generate : Generates the csv data file based on configured NodeStore/BlobStore
Option Description
------ -----------
-?, -h, --help show help
--data-file <File> Data file in csv format containing the
binary metadata
--fds-path <File> Path of directory used by FileDataStore
--nodestore NodeStore detail
/path/to/oak/repository | mongodb:
//host:port/database
--path Path in repository under which the
binaries would be searched
--pool-size <Integer> Size of the thread pool used to
perform text extraction. Defaults to
number of cores on the system
--store-path <File> Path of directory used to store
extracted text content
--tika-config <File> Tika config file path
{noformat}
h3. Report
Tool can generate a summary report from a [csv|#csv] file
bq. java -jar target/oak-run.jar tika --data-file /path/to/binary-stats.csv
report
The report provides a summary like
{noformat}
14:39:05.402 [main] INFO o.a.j.o.p.tika.TextExtractorMain - MimeType Stats
Total size : 89.3 MB
Total indexed size : 3.4 MB
Total count : 1048
Type Indexed Supported Count Size
___________________________________________________________________________________
application/epub+zip | true| true| 1 | 3.4 MB
image/png | false| true| 544 | 40.2 MB
image/jpeg | false| true| 444 | 34.0 MB
image/tiff | false| true| 11 | 6.1 MB
application/x-indesign | false| false| 1 | 3.7 MB
application/octet-stream | false| false| 39 | 1.2 MB
application/x-shockwave-flash | false| false| 4 | 372.2 kB
application/pdf | false| false| 3 | 168.3 kB
video/quicktime | false| false| 1 | 95.9 kB
{noformat}
h3. Extraction
Extraction can be performed via following command
bq. java -cp oak-run-1.4-SNAPSHOT.jar:tika-app-1.8.jar
org.apache.jackrabbit.oak.run.Main tika --data-file binary-stats.csv
--store-path ./store --fds-path /path/to/datastore extract
You would need to provide the tika-app jar which contains all the parsers. It
can be downloaded from [here|https://tika.apache.org/download.html]
{anchor:csv}
h3. CSV File Format
User can provide a CSV file which contains the details about binary
{noformat}
43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/dam/geometrixx-outdoors/activities/jcr:content/folderThumbnail/jcr:content"
43844ed22d640a114134e5a25550244e8836c00c#28705,28705,"application/octet-stream",,"/content/dam/geometrixx-outdoors/activities/snowboarding/jcr:content/folderThumbnail/jcr:content"
...
{noformat}
Where the columns are in following order
# BlobId - Value of [Jackrabbit
ContentIdentity|http://jackrabbit.apache.org/api/2.0/org/apache/jackrabbit/api/JackrabbitValue.html]
# Length
# jcr:mimeType
# jcr:encoding
# path of parent node
If you using Sling then the csv can be generated from [1]
[1] https://gist.github.com/chetanmeh/be66363172532e09ee7d
> Implement text extractor as part of oak-run
> -------------------------------------------
>
> Key: OAK-2953
> URL: https://issues.apache.org/jira/browse/OAK-2953
> Project: Jackrabbit Oak
> Issue Type: Sub-task
> Components: run
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Fix For: 1.3.0
>
> Attachments: OAK-2953.patch
>
>
> Implement a crawler and indexer which can find out all binary content in
> repository under certain path and extracts text from them and store them
> somewhere
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)