[
https://issues.apache.org/jira/browse/TIKA-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226121#comment-14226121
]
Tim Allison commented on TIKA-1331:
-----------------------------------
All,
With much gratitude to Rackspace, we now have a vm for this effort: 8 CPU,
30GB RAM, 1 TB. I was impressed with how easy it was to set this up.
I've loaded Redhat 1.6, I wgot govdocs1 last night, and I'm currently
unzipping that corpus.
I'm thinking that the public directory structure will look something like
this:
{noformat}
corpora/
govdocs1/
archive/ #use this for zips
docs/ #use this for unzipped docs
extracts/ #use this for tika runs
tika_1_5/ #e.g.
logs/ #logs for that run
docs/ #the .json files that were generated by the run of
Tika 1.5
results/ #for now, static html with statistics on the run,
number of exceptions, etc.
notes.txt #any notes for the run
othercorpus/
....
{noformat}
I probably won't have a chance to go live for a few weeks. And, by "going
live", I just mean setting up some basic static html file viewers.
At least initially, this service will be unreliable...may need to take down to
reconfig, etc.
What should we call the vm? tika-eval.apache.org?
> Find/configure a vm and gather initial corpus
> ---------------------------------------------
>
> Key: TIKA-1331
> URL: https://issues.apache.org/jira/browse/TIKA-1331
> Project: Tika
> Issue Type: Sub-task
> Components: cli, general, server
> Reporter: Tim Allison
> Assignee: Tim Allison
>
> Let's start with govdocs1 for this issue unless there are other easy options.
> Going forward, we'll want and need to add a more diverse set of documents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)