[COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Allison, Timothy B. Tue, 07 Apr 2015 05:50:14 -0700

All,

  We just heard back from a very active member of Common Crawl.  I don’t want 
to clog up our dev lists with this discussion (more than I have!), but I do 
want to invite all to participate in the discussion, planning and potential 
patches.


  If you’d like to participate, please join us here: 
https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

  I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the 
Subject line.  Please invite others who might have an interest in this work.

         Best,

                     Tim

From: Allison, Timothy B.
Sent: Tuesday, April 07, 2015 8:39 AM
To: 'Stephen Merity'; [email protected]
Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?

Stephen,

  Thank you very much for responding so quickly and for all of your work on 
Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve 
gotten so far from some of the dev communities, I think we would very much 
appreciate the chance to be tested on a monthly basis as part of the regular 
Common Crawl process.

   I think we’ll still want to run more often in our own sandbox(es) on the 
slice of CommonCrawl we have, but the monthly testing against new data, from my 
perspective at least, would be a huge win for all of us.

   In addition to parsing binaries and extracting text, Tika (via PDFBox, POI 
and many others) can also offer metadata (e.g. exif from images), which users 
of CommonCrawl might find of use.

  I’ll forward this to some of the relevant dev lists to invite others to 
participate in the discussion on the common-crawl list.


  Thank you, again.  I very much look forward to collaborating.

             Best,

                         Tim

From: Stephen Merity [mailto:[email protected]]
Sent: Tuesday, April 07, 2015 3:57 AM
To: [email protected]<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?

Hi Tika team!

We'd certainly be interested in working with Apache Tika on such an 
undertaking. At the very least, we're glad that Julien has provided you with 
content to battle test Tika with!

As you've noted, the text extraction performed to produce WET files are focused 
primarily on HTML files, leaving many other file types not covered. The 
existing text extraction is quite efficient and part of the same process that 
generates the WAT file, meaning there's next to no overhead. Performing 
extraction with Tika at the scale of Common Crawl would be an interesting 
challenge. Running it as a once off wouldn't likely be too much of a challenge 
and would also give Tika the benefit of a wider variety of documents (both well 
formed and malformed) to test against. Running it on a frequent basis or as 
part of the crawl pipeline would be more challenging but something we can 
certainly discuss, especially if there's strong community desire for it!

On Fri, Apr 3, 2015 at 5:23 AM, 
<[email protected]<mailto:[email protected]>> wrote:
CommonCrawl currently has the WET format that extracts plain text from web 
pages.  My guess is that this is text stripping from text-y formats.  Let me 
know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or 
supplementing the current WET by using Tika to extract contents from binary 
formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on 
TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  
But, I'm wondering now if it would make more sense to have CommonCrawl run Tika 
as part of its regular process and make the output available in one of your 
standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community 
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
help prioritize bug fixes.

Cheers,

          Tim
--
You received this message because you are subscribed to the Google Groups 
"Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

[COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to