[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219371#comment-14219371
]
Tim Allison edited comment on TIKA-1302 at 11/20/14 1:59 PM:
-------------------------------------------------------------
HPC is way beyond current status of tika-batch, which is initially aimed at
conventional/single-box computing. I heartily welcome tika-batch-hadoop and
any other tika-batch-HPC packages!.
If you do want to join in the effort on tika-batch, please do! I need plenty
of help in code review, unit tests, usability and edge case (i.e bug)
discovery. I'd also love to halve the amount of code and keep the robustness,
extensibility and logging.
You can grab my dev version of tika-batch from my github
[fork|https://github.com/tballison/tika/tree/TIKA-1302]. See some background
on the [wiki|https://wiki.apache.org/tika/TikaBatchOverview]. I finished an
initial integration with tika-app, and you should be able to run tika-app with:
{noformat}
java -jar tika-app.jar <srcDirectory>
{noformat}
That will iterate through the srcDirectory and output files in an a directory
named "output"
There are lots of commandline arguments available. -I'm going to update the
usage [wiki|http://wiki.apache.org/tika/TikaBatchUsage] shortly, but the usual
-? from the app will give you some of the options.- I've updated the
TikaBatchUsage wiki just now. Let me know when you have questions.
was (Author: [email protected]):
HPC is way beyond current status of tika-batch, which is initially aimed at
conventional/single-box computing. I heartily welcome tika-batch-hadoop and
any other tika-batch-HPC packages!.
If you do want to join in the effort on tika-batch, please do! I need plenty
of help in code review, unit tests, usability and edge case (i.e bug)
discovery. I'd also love to halve the amount of code and keep the robustness,
extensibility and logging.
You can grab my dev version of tika-batch from my github
[fork|https://github.com/tballison/tika/tree/TIKA-1302]. See some background
on the [wiki|https://wiki.apache.org/tika/TikaBatchOverview]. I finished an
initial integration with tika-app, and you should be able to run tika-app with:
{noformat}
java -jar tika-app.jar <srcDirectory>
{noformat}
That will iterate through the srcDirectory and output files in an a directory
named "output"
There are lots of commandline arguments available. I'm going to update the
usage [wiki|http://wiki.apache.org/tika/TikaBatchUsage] shortly, but the usual
-? from the app will give you some of the options.
> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
> Issue Type: Improvement
> Components: cli, general, server
> Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and
> running again, it might be fun to run Tika regularly against a large set of
> docs and report metrics.
> One excellent candidate corpus is govdocs1:
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?
> [~willp-bl], have anything handy you'd like to contribute?
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
> ;)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)