Re: Resource Sharing Tika Corpus with Any23

Lewis John McGibbney Fri, 30 Nov 2018 17:14:37 -0800

Hi Tim,
Thanks for the reply... answer inline

On 2018/11/30 19:22:23, Tim Allison <talli...@apache.org> wrote: 
> I think that'd be great.  Some questions:
> 
> 1) Would you use the same input docs that we're using or would you
> need/want a new TB drive for your input/output?


The same docs I suspect. We *could* contribute the documents we use in our test 
suite as well
https://github.com/apache/any23/tree/master/test-resources/src/test/resources
however this is not really necessary for us to run Any23. Any23 will only 
attempt extractions on a small subset of the documents in the corpus.

> How much space will
> you need for your eval framework including outputs?

I wouldn't imagine any more than maybe 5GB disk space in all. Any23 has the 
ability to run Open Information Extraction (smart relationship extraction from 
text) and this tends to generate more triples. If we decided to turn this on, 
then it would probably get towards the 5GB mark. I wouldnt imagine any more 
than that thought Tim.

> 2) Would you be willing to coordinate with us and PDFBox and POI
> around release times?

I think so yes. If anything this would be an excellent thing for Any23. I think 
improved coordination and communication between the communities would be a very 
positive step.

> 3) Would you be running your processing every so often (around your
> releases) or would it be constant aside from our releases? 

Most likely the former. I am aware that the service is billed to someones 
(your) card. So we would be looking to do only what is polite and acceptable. 
Prior to releases e.g. during review of a release candidate would be really 
cool. 

>  I ask
> because I'd like @Tobias Ospelt to have cycles for his fuzzing work
> when we're not getting ready for a release.
> 

That sounds fine to me. 
Thank you for the response.

Re: Resource Sharing Tika Corpus with Any23

Reply via email to