[jira] [Commented] (TIKA-1331) Find/configure a vm and gather initial corpus

Tim Allison (JIRA) Tue, 03 Feb 2015 11:24:56 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303801#comment-14303801
 ]


Tim Allison commented on TIKA-1331:
-----------------------------------

The uncompressed output of tika-batch for govdocs1 is 113G.  A per 
batch-comparison database will probably be on the order of .5G.

I now think we might want to split like so:
{noformat}
archived_corpora/
    govdocs1/
    commoncrawl1/

unzipped_corpora
    govdocs1/
    commoncrawl1/

batch_runs/
    govdocs1/
          tika_1_5/
                files/
                summary_stats/
          tika_1_6/
          tika_1_7/
          comparisons/
                tika_1_5Vtika_1_6/
                ...
    commoncrawl1/
          tika_1_5
....
{noformat}

Something along those lines, so we can put the batch runs on the new drive and 
keep the other archived stuff as is or flip them.

As for the navigation of results, I'll hang a current static dump on TIKA-1334. 
 I have a rudimentary side-by-side file viewer, but much more remains.

> Find/configure a vm and gather initial corpus
> ---------------------------------------------
>
>                 Key: TIKA-1331
>                 URL: https://issues.apache.org/jira/browse/TIKA-1331
>             Project: Tika
>          Issue Type: Sub-task
>          Components: cli, general, server
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>         Attachments: tika-eval-vm-setup.tar.bz2
>
>
> Let's start with govdocs1 for this issue unless there are other easy options. 
>  Going forward, we'll want and need to add a more diverse set of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1331) Find/configure a vm and gather initial corpus

Reply via email to