Hi,

I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.

I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.

The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).

Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.

Please give it a try if you are interested and let me know what you think.

Dominik.

On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <talli...@mitre.org> wrote:
> All,
>
>   We just heard back from a very active member of Common Crawl.  I don’t want 
> to clog up our dev lists with this discussion (more than I have!), but I do 
> want to invite all to participate in the discussion, planning and potential 
> patches.
>
>   If you’d like to participate, please join us here: 
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the 
> Subject line.  Please invite others who might have an interest in this work.
>
>          Best,
>
>                      Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; common-cr...@googlegroups.com
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
>   Thank you very much for responding so quickly and for all of your work on 
> Common Crawl.  I don’t want to speak for all of us, but given the feedback 
> I’ve gotten so far from some of the dev communities, I think we would very 
> much appreciate the chance to be tested on a monthly basis as part of the 
> regular Common Crawl process.
>
>    I think we’ll still want to run more often in our own sandbox(es) on the 
> slice of CommonCrawl we have, but the monthly testing against new data, from 
> my perspective at least, would be a huge win for all of us.
>
>    In addition to parsing binaries and extracting text, Tika (via PDFBox, POI 
> and many others) can also offer metadata (e.g. exif from images), which users 
> of CommonCrawl might find of use.
>
>   I’ll forward this to some of the relevant dev lists to invite others to 
> participate in the discussion on the common-crawl list.
>
>
>   Thank you, again.  I very much look forward to collaborating.
>
>              Best,
>
>                          Tim
>
> From: Stephen Merity [mailto:step...@commoncrawl.org]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com>
> Cc: mattm...@apache.org<mailto:mattm...@apache.org>; 
> talli...@apache.org<mailto:talli...@apache.org>; 
> dmei...@apache.org<mailto:dmei...@apache.org>; 
> til...@apache.org<mailto:til...@apache.org>; 
> n...@apache.org<mailto:n...@apache.org>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an 
> undertaking. At the very least, we're glad that Julien has provided you with 
> content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are 
> focused primarily on HTML files, leaving many other file types not covered. 
> The existing text extraction is quite efficient and part of the same process 
> that generates the WAT file, meaning there's next to no overhead. Performing 
> extraction with Tika at the scale of Common Crawl would be an interesting 
> challenge. Running it as a once off wouldn't likely be too much of a 
> challenge and would also give Tika the benefit of a wider variety of 
> documents (both well formed and malformed) to test against. Running it on a 
> frequent basis or as part of the crawl pipeline would be more challenging but 
> something we can certainly discuss, especially if there's strong community 
> desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM, 
> <tallison314...@gmail.com<mailto:tallison314...@gmail.com>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web 
> pages.  My guess is that this is text stripping from text-y formats.  Let me 
> know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or 
> supplementing the current WET by using Tika to extract contents from binary 
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on 
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm. 
>  But, I'm wondering now if it would make more sense to have CommonCrawl run 
> Tika as part of its regular process and make the output available in one of 
> your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community 
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
> help prioritize bug fixes.
>
> Cheers,
>
>           Tim
> --
> You received this message because you are subscribed to the Google Groups 
> "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to 
> common-crawl+unsubscr...@googlegroups.com<mailto:common-crawl+unsubscr...@googlegroups.com>.
> To post to this group, send email to 
> common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to