Yes, this would be great, if you need any PDFBox assistance then count me in.
-- John > On 3 Apr 2015, at 05:35, [email protected] wrote: > > All, > What do we think? > >> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] wrote: >> CommonCrawl currently has the WET format that extracts plain text from web >> pages. My guess is that this is text stripping from text-y formats. Let me >> know if I'm wrong! >> >> Would there be any interest in adding another format: WETT (WET-Tika) or >> supplementing the current WET by using Tika to extract contents from binary >> formats too: PDF, MSWord, etc. >> >> Julien Nioche kindly carved out 220 GB for us to experiment with on >> TIKA-1302 on a Rackspace vm. But, I'm wondering now if it would make more >> sense to have CommonCrawl run Tika as part of its regular process and make >> the output available in one of your standard formats. >> >> CommonCrawl consumers would get Tika output, and the Tika dev community >> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to >> help prioritize bug fixes. >> >> Cheers, >> >> Tim > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected]
