Tim, seems interesting, because it provides big test dataset. As I see, they store pdfs/docs in WARC files, so there's source data for parsing.
-- Best regards, Konstantin Gribov пт, 3 апр. 2015 г. в 17:29, Allison, Timothy B. <[email protected]>: > All, > > What do you think? > > > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 > > > On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected]<mailto: > [email protected]> wrote: > CommonCrawl currently has the WET format that extracts plain text from web > pages. My guess is that this is text stripping from text-y formats. Let > me know if I'm wrong! > > Would there be any interest in adding another format: WETT (WET-Tika) or > supplementing the current WET by using Tika to extract contents from binary > formats too: PDF, MSWord, etc. > > Julien Nioche kindly carved out 220 GB for us to experiment with on > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace > vm. But, I'm wondering now if it would make more sense to have CommonCrawl > run Tika as part of its regular process and make the output available in > one of your standard formats. > > CommonCrawl consumers would get Tika output, and the Tika dev community > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces > to help prioritize bug fixes. > > Cheers, > > Tim > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
