Tim,
seems interesting, because it provides big test dataset.
As I see, they store pdfs/docs in WARC files, so there's source data for
parsing.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 17:29, Allison, Timothy B. <[email protected]>:

> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected]<mailto:
> [email protected]> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages.  My guess is that this is text stripping from text-y formats.  Let
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
>
> Cheers,
>
>           Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to