Hi Julien,
  We're just beginning to scratch the surface.  There's much to learn from this 
set.  Apologies for my delay, and thank you!

These proportions line up pretty closely with your blog post 
(http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apache-tika.html)
 

Total files: 2,135,515

Detected content types:
DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2)  COUNT 
image/jpeg       857,625        
application/pdf  320,443 
text/plain; charset=ISO-8859-1   276,152 
image/png        184,855 
text/plain; charset=windows-1252         164,327 
image/gif        51,809 
text/plain; charset=UTF-8        44,766 
audio/x-wav      34,402 
application/octet-stream         28,586 
message/rfc822   18,231 
text/html; charset=ISO-8859-1    17,528 
application/xhtml+xml; charset=UTF-8     16,845 
application/zip  14,385 
text/html; charset=UTF-8         9,626 
audio/mpeg       8,670 
text/html; charset=windows-1252  7,818 
application/msword       7,782 
application/x-archive    5,970 
application/x-bibtex-text-file   5,274 
application/xml  5,234 
image/vnd.djvu   5,063 
application/rss+xml      4,726 
application/gzip         4,443 
application/xhtml+xml; charset=ISO-8859-1        4,228 
application/epub+zip     3,458 
image/tiff       2,980 
image/jp2        2,706 
application/rtf  1,622 
        
 

________________________________________
From: Julien Nioche <[email protected]>
Sent: Tuesday, April 14, 2015 9:24 AM
To: [email protected]
Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2

Hi Tim

Great to hear that you managed to use the dataset from CommonCrawl. Thanks!

Julien

On 14 April 2015 at 14:15, Allison, Timothy B. <[email protected]> wrote:

> +1
>
> Thank you, Tyler!
>
> Apologies to Hong-Thai and community for not recognizing the severity of
> TIKA-1600 when I voted in favor of rc1!
>
> Details...
>
> I reran against govdocs1, and there aren't any major surprises.
>
> On our Rackspace vm, I  _finally_ unzipped the Common Crawl slice that
> Julien Nioche created for us, and I ran against that as well.  That turned
> up TIKA-1605 and another exceedingly rare NPE in the PDFParser.  I don't
> think either of these are blockers, and they're now fixed in trunk.
>
> There are slightly fewer metadata values for some jpegs.  For the one file
> that I manually reviewed, 1.8-rc was missing these values (that were
> available in 1.7):
>
> JPEG quality
> IPTC-NAA record
> Plug-in 1 Data
>
> Comparison reports are available here (much more work remains to be done
> on tika-eval):
>
> https://github.com/tballison/share/tree/master/tika_comparisons
>
> ________________________________________
> From: Tyler Palsulich <[email protected]>
> Sent: Monday, April 13, 2015 1:56 PM
> To: [email protected]; [email protected]
> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>
> Hi Folks,
>
> A candidate for the Tika 1.8 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
>
> The SHA1 checksum of the archive is
>   5e22fee9079370398472e59082d171ae2d7fdd31.
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/orgapachetika-1009
>
> Please vote on releasing this package as Apache Tika 1.8. The vote is open
> for the next 72 hours and passes if a majority of at least three +1 Tika
> PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.8
> [ ] ±0 I don't object to this release, but I haven't checked it
> [ ] -1 Do not release this package because...
>
> Thanks,
> Tyler
>



--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to