Hi, I have now published a first version of a tool to download binary data of certain file types from the Common Crawl URL Index. Currently it only supports the previous index format, so the data is from around 2012/2013, but this also provides tons of files for mass-testing of our frameworks.
I used a small part of the files to run some integration testing locally and immediately found a few issues where specially formatted files broke Apache POI. The project is currently available at https://github.com/centic9/CommonCrawlDocumentDownload, it has options for downloading files as well as first retrieving a list of all interesting files and then downloading them later. But it should also be easily possible to change it so it processes the files on-the-fly (if you want to spare the estimated >300G of disk space it will need for example to store files interesting for Apache POI testing). Naturally running this on Amazon EC2 machines can speed up the downloading a lot as then the network access to Amazon S3 is much faster. Please give it a try if you are interested and let me know what you think. Dominik. On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > All, > > We just heard back from a very active member of Common Crawl. I don’t want > to clog up our dev lists with this discussion (more than I have!), but I do > want to invite all to participate in the discussion, planning and potential > patches. > > If you’d like to participate, please join us here: > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 > > I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the > Subject line. Please invite others who might have an interest in this work. > > Best, > > Tim > > From: Allison, Timothy B. > Sent: Tuesday, April 07, 2015 8:39 AM > To: 'Stephen Merity'; common-cr...@googlegroups.com > Subject: RE: Any interest in running Apache Tika as part of CommonCrawl? > > Stephen, > > Thank you very much for responding so quickly and for all of your work on > Common Crawl. I don’t want to speak for all of us, but given the feedback > I’ve gotten so far from some of the dev communities, I think we would very > much appreciate the chance to be tested on a monthly basis as part of the > regular Common Crawl process. > > I think we’ll still want to run more often in our own sandbox(es) on the > slice of CommonCrawl we have, but the monthly testing against new data, from > my perspective at least, would be a huge win for all of us. > > In addition to parsing binaries and extracting text, Tika (via PDFBox, POI > and many others) can also offer metadata (e.g. exif from images), which users > of CommonCrawl might find of use. > > I’ll forward this to some of the relevant dev lists to invite others to > participate in the discussion on the common-crawl list. > > > Thank you, again. I very much look forward to collaborating. > > Best, > > Tim > > From: Stephen Merity [mailto:step...@commoncrawl.org] > Sent: Tuesday, April 07, 2015 3:57 AM > To: common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com> > Cc: mattm...@apache.org<mailto:mattm...@apache.org>; > talli...@apache.org<mailto:talli...@apache.org>; > dmei...@apache.org<mailto:dmei...@apache.org>; > til...@apache.org<mailto:til...@apache.org>; > n...@apache.org<mailto:n...@apache.org> > Subject: Re: Any interest in running Apache Tika as part of CommonCrawl? > > Hi Tika team! > > We'd certainly be interested in working with Apache Tika on such an > undertaking. At the very least, we're glad that Julien has provided you with > content to battle test Tika with! > > As you've noted, the text extraction performed to produce WET files are > focused primarily on HTML files, leaving many other file types not covered. > The existing text extraction is quite efficient and part of the same process > that generates the WAT file, meaning there's next to no overhead. Performing > extraction with Tika at the scale of Common Crawl would be an interesting > challenge. Running it as a once off wouldn't likely be too much of a > challenge and would also give Tika the benefit of a wider variety of > documents (both well formed and malformed) to test against. Running it on a > frequent basis or as part of the crawl pipeline would be more challenging but > something we can certainly discuss, especially if there's strong community > desire for it! > > On Fri, Apr 3, 2015 at 5:23 AM, > <tallison314...@gmail.com<mailto:tallison314...@gmail.com>> wrote: > CommonCrawl currently has the WET format that extracts plain text from web > pages. My guess is that this is text stripping from text-y formats. Let me > know if I'm wrong! > > Would there be any interest in adding another format: WETT (WET-Tika) or > supplementing the current WET by using Tika to extract contents from binary > formats too: PDF, MSWord, etc. > > Julien Nioche kindly carved out 220 GB for us to experiment with on > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm. > But, I'm wondering now if it would make more sense to have CommonCrawl run > Tika as part of its regular process and make the output available in one of > your standard formats. > > CommonCrawl consumers would get Tika output, and the Tika dev community > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to > help prioritize bug fixes. > > Cheers, > > Tim > -- > You received this message because you are subscribed to the Google Groups > "Common Crawl" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to > common-crawl+unsubscr...@googlegroups.com<mailto:common-crawl+unsubscr...@googlegroups.com>. > To post to this group, send email to > common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com>. > Visit this group at http://groups.google.com/group/common-crawl. > For more options, visit https://groups.google.com/d/optout. > > > > -- > Regards, > Stephen Merity > Data Scientist @ Common Crawl --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org