RE: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Dominik Stadler Mon, 01 Jun 2015 09:34:27 -0700

That's likely on my side, sorry, I'll take a look....

Dominik
Am 01.06.2015 16:51 schrieb "Allison, Timothy B." <[email protected]>:


> Dominik,
>   Thank you for making this available!  I'm trying to build/run now, and
> I'm getting this...is this user error?
>
>
>
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20:
> error: package org.dstadler.commons.testing does not exist
> import org.dstadler.commons.testing.MockRESTServer;
>                                    ^
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21:
> error: package org.dstadler.commons.testing does not exist
> import org.dstadler.commons.testing.TestHelpers;
>                                    ^
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31:
> error: package org.dstadler.commons.testing does not exist
>
> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
>                                     ^
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
> error: cannot find symbol
>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
> "text/plain", "Ok")) {
>              ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
> error: cannot find symbol
>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
> "text/plain", "Ok")) {
>                                          ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
> error: cannot find symbol
>         try (MockRESTServer server = new
> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>              ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
> error: cannot find symbol
>         try (MockRESTServer server = new
> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>                                          ^
>   symbol:   class MockRESTServer
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179:
> error: cannot find symbol
>                         TestHelpers.assertContains(e, "500", "localhost",
> Integer.toString(server.getPort()));
>                         ^
>   symbol:   variable TestHelpers
>   location: class UtilsTest
> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205:
> error: package org.dstadler.commons.testing does not exist
>
> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
>                                     ^
> 9 errors
> :compileTestJava FAILED
>
> -----Original Message-----
> From: Dominik Stadler [mailto:[email protected]]
> Sent: Wednesday, April 22, 2015 4:07 PM
> To: POI Developers List
> Cc: [email protected]; [email protected]; [email protected]
> Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika
> as part of CommonCrawl?
>
> Hi,
>
> I have now published a first version of a tool to download binary data
> of certain file types from the Common Crawl URL Index. Currently it
> only supports the previous index format, so the data is from around
> 2012/2013, but this also provides tons of files for mass-testing of
> our frameworks.
>
> I used a small part of the files to run some integration testing
> locally and immediately found a few issues where specially formatted
> files broke Apache POI.
>
> The project is currently available at
> https://github.com/centic9/CommonCrawlDocumentDownload, it has options
> for downloading files as well as first retrieving a list of all
> interesting files and then downloading them later. But it should also
> be easily possible to change it so it processes the files on-the-fly
> (if you want to spare the estimated >300G of disk space it will need
> for example to store files interesting for Apache POI testing).
>
> Naturally running this on Amazon EC2 machines can speed up the
> downloading a lot as then the network access to Amazon S3 is much
> faster.
>
> Please give it a try if you are interested and let me know what you think.
>
> Dominik.
>
> On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <[email protected]>
> wrote:
> > All,
> >
> >   We just heard back from a very active member of Common Crawl.  I don’t
> want to clog up our dev lists with this discussion (more than I have!), but
> I do want to invite all to participate in the discussion, planning and
> potential patches.
> >
> >   If you’d like to participate, please join us here:
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
> >
> >   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to
> the Subject line.  Please invite others who might have an interest in this
> work.
> >
> >          Best,
> >
> >                      Tim
> >
> > From: Allison, Timothy B.
> > Sent: Tuesday, April 07, 2015 8:39 AM
> > To: 'Stephen Merity'; [email protected]
> > Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
> >
> > Stephen,
> >
> >   Thank you very much for responding so quickly and for all of your work
> on Common Crawl.  I don’t want to speak for all of us, but given the
> feedback I’ve gotten so far from some of the dev communities, I think we
> would very much appreciate the chance to be tested on a monthly basis as
> part of the regular Common Crawl process.
> >
> >    I think we’ll still want to run more often in our own sandbox(es) on
> the slice of CommonCrawl we have, but the monthly testing against new data,
> from my perspective at least, would be a huge win for all of us.
> >
> >    In addition to parsing binaries and extracting text, Tika (via
> PDFBox, POI and many others) can also offer metadata (e.g. exif from
> images), which users of CommonCrawl might find of use.
> >
> >   I’ll forward this to some of the relevant dev lists to invite others
> to participate in the discussion on the common-crawl list.
> >
> >
> >   Thank you, again.  I very much look forward to collaborating.
> >
> >              Best,
> >
> >                          Tim
> >
> > From: Stephen Merity [mailto:[email protected]]
> > Sent: Tuesday, April 07, 2015 3:57 AM
> > To: [email protected]<mailto:[email protected]>
> > Cc: [email protected]<mailto:[email protected]>; [email protected]
> <mailto:[email protected]>; [email protected]<mailto:[email protected]>;
> [email protected]<mailto:[email protected]>; [email protected]<mailto:
> [email protected]>
> > Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
> >
> > Hi Tika team!
> >
> > We'd certainly be interested in working with Apache Tika on such an
> undertaking. At the very least, we're glad that Julien has provided you
> with content to battle test Tika with!
> >
> > As you've noted, the text extraction performed to produce WET files are
> focused primarily on HTML files, leaving many other file types not covered.
> The existing text extraction is quite efficient and part of the same
> process that generates the WAT file, meaning there's next to no overhead.
> Performing extraction with Tika at the scale of Common Crawl would be an
> interesting challenge. Running it as a once off wouldn't likely be too much
> of a challenge and would also give Tika the benefit of a wider variety of
> documents (both well formed and malformed) to test against. Running it on a
> frequent basis or as part of the crawl pipeline would be more challenging
> but something we can certainly discuss, especially if there's strong
> community desire for it!
> >
> > On Fri, Apr 3, 2015 at 5:23 AM, <[email protected]<mailto:
> [email protected]>> wrote:
> > CommonCrawl currently has the WET format that extracts plain text from
> web pages.  My guess is that this is text stripping from text-y formats.
> Let me know if I'm wrong!
> >
> > Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
> >
> > Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
> >
> > CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
> >
> > Cheers,
> >
> >           Tim
> > --
> > You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]<mailto:
> [email protected]>.
> > To post to this group, send email to [email protected]
> <mailto:[email protected]>.
> > Visit this group at http://groups.google.com/group/common-crawl.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> >
> > --
> > Regards,
> > Stephen Merity
> > Data Scientist @ Common Crawl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

RE: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to