Makes sense.

My thought was to continue (well, actually develop and mature) that comparison 
work on the Rackspace vm.  

This could be an ancillary source of information.  It would come in monthly and 
wouldn't be as timely as being able to do our own runs, and it would only cover 
a single version, but I think it would still be quite valuable. 

-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]] 
Sent: Friday, April 03, 2015 10:21 AM
To: [email protected]
Subject: Re: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

I'm mostly interested in differences between crawls with different 
PDFBox versions.

And I already have one change where I wonder if anything will happen: 
the text stripper code has this

wordSpacing == Float.NaN

however that is always false, and I wonder what differences will come up 
when using the correct code, which is

Float.isNaN(wordSpacing)

Tilman

Am 03.04.2015 um 14:35 schrieb [email protected]:
> All,
>   What do we think?
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] wrote:
>
>     CommonCrawl currently has the WET format that extracts plain text
>     from web pages.  My guess is that this is text stripping from
>     text-y formats.  Let me know if I'm wrong!
>
>     Would there be any interest in adding another format: WETT
>     (WET-Tika) or supplementing the current WET by using Tika to
>     extract contents from binary formats too: PDF, MSWord, etc.
>
>     Julien Nioche kindly carved out 220 GB for us to experiment with
>     on TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on
>     a Rackspace vm.  But, I'm wondering now if it would make more
>     sense to have CommonCrawl run Tika as part of its regular
>     process and make the output available in one of your standard
>     formats.
>
>     CommonCrawl consumers would get Tika output, and the Tika dev
>     community (including its dependencies, PDFBox, POI, etc.) could
>     get the stacktraces to help prioritize bug fixes.
>
>     Cheers,
>
>               Tim
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to