Re: Any interest in running Apache Tika as part of CommonCrawl?

John Hewson Sun, 05 Apr 2015 23:05:57 -0700

Yes, this would be great, if you need any PDFBox assistance then count me in.


-- John

> On 3 Apr 2015, at 05:35, [email protected] wrote:
> 
> All,
>   What do we think?
> 
>> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] wrote:
>> CommonCrawl currently has the WET format that extracts plain text from web 
>> pages.  My guess is that this is text stripping from text-y formats.  Let me 
>> know if I'm wrong!
>> 
>> Would there be any interest in adding another format: WETT (WET-Tika) or 
>> supplementing the current WET by using Tika to extract contents from binary 
>> formats too: PDF, MSWord, etc.
>> 
>> Julien Nioche kindly carved out 220 GB for us to experiment with on 
>> TIKA-1302 on a Rackspace vm.  But, I'm wondering now if it would make more 
>> sense to have CommonCrawl run Tika as part of its regular process and make 
>> the output available in one of your standard formats. 
>> 
>> CommonCrawl consumers would get Tika output, and the Tika dev community 
>> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
>> help prioritize bug fixes.
>> 
>> Cheers,
>> 
>>           Tim 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Re: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to