[jira] [Commented] (TIKA-1331) Find/configure a vm and gather initial corpus

Tim Allison (JIRA) Wed, 26 Nov 2014 04:28:31 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226121#comment-14226121
 ]


Tim Allison commented on TIKA-1331:
-----------------------------------

All,

  With much gratitude to Rackspace, we now have a vm for this effort: 8 CPU, 
30GB RAM, 1 TB.  I was impressed with how easy it was to set this up.
 
  I've loaded Redhat 1.6, I wgot govdocs1 last night, and I'm currently 
unzipping that corpus.  

  I'm thinking that the public directory structure will look something like 
this:

{noformat}
corpora/
   govdocs1/
         archive/ #use this for zips
         docs/ #use this for unzipped docs
         extracts/ #use this for tika runs
              tika_1_5/ #e.g.
                    logs/ #logs for that run
                    docs/ #the .json files that were generated by the run of 
Tika 1.5
                    results/ #for now, static html with statistics on the run, 
number of exceptions, etc.
                    notes.txt #any notes for the run
   othercorpus/
             ....
{noformat}

I probably won't have a chance to go live for a few weeks.  And, by "going 
live", I just mean setting up some basic static html file viewers.

At least initially, this service will be unreliable...may need to take down to 
reconfig, etc.

What should we call the vm?  tika-eval.apache.org?

> Find/configure a vm and gather initial corpus
> ---------------------------------------------
>
>                 Key: TIKA-1331
>                 URL: https://issues.apache.org/jira/browse/TIKA-1331
>             Project: Tika
>          Issue Type: Sub-task
>          Components: cli, general, server
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>
> Let's start with govdocs1 for this issue unless there are other easy options. 
>  Going forward, we'll want and need to add a more diverse set of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1331) Find/configure a vm and gather initial corpus

Reply via email to