[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

Tim Allison (JIRA) Wed, 26 Nov 2014 08:29:46 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226393#comment-14226393
 ]


Tim Allison edited comment on TIKA-1302 at 11/26/14 4:28 PM:
-------------------------------------------------------------

Looks like I'll need to rm govdocs1 zips to clear some space or link another 
drive! :)

[~jnioche], near term, would you be willing to scp some files to the vm we're 
building for this?  Longer term, once we get the process running in a 
conventional environment, it'd be great to move to hadoop.

[~chrismattmann], same with you?  

300gb-ish sample for both corpora reasonable?



was (Author: [email protected]):
Looks like I'll need to rm govdocs1 zips to clear some space or link another 
drive! :)

[~jnioche], near term, would you be willing to scp some files to the vm we're 
building for this?  Longer term, once we get the process running in a 
conventional environment, it'd be great to move to hadoop.

[~chrismattmann], same with you?  




> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>         Attachments: wayback_exception_summaries.xlsx
>
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

Reply via email to