Hi Julien, We're just beginning to scratch the surface. There's much to learn from this set. Apologies for my delay, and thank you!
These proportions line up pretty closely with your blog post (http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apache-tika.html) Total files: 2,135,515 Detected content types: DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2) COUNT image/jpeg 857,625 application/pdf 320,443 text/plain; charset=ISO-8859-1 276,152 image/png 184,855 text/plain; charset=windows-1252 164,327 image/gif 51,809 text/plain; charset=UTF-8 44,766 audio/x-wav 34,402 application/octet-stream 28,586 message/rfc822 18,231 text/html; charset=ISO-8859-1 17,528 application/xhtml+xml; charset=UTF-8 16,845 application/zip 14,385 text/html; charset=UTF-8 9,626 audio/mpeg 8,670 text/html; charset=windows-1252 7,818 application/msword 7,782 application/x-archive 5,970 application/x-bibtex-text-file 5,274 application/xml 5,234 image/vnd.djvu 5,063 application/rss+xml 4,726 application/gzip 4,443 application/xhtml+xml; charset=ISO-8859-1 4,228 application/epub+zip 3,458 image/tiff 2,980 image/jp2 2,706 application/rtf 1,622 ________________________________________ From: Julien Nioche <[email protected]> Sent: Tuesday, April 14, 2015 9:24 AM To: [email protected] Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2 Hi Tim Great to hear that you managed to use the dataset from CommonCrawl. Thanks! Julien On 14 April 2015 at 14:15, Allison, Timothy B. <[email protected]> wrote: > +1 > > Thank you, Tyler! > > Apologies to Hong-Thai and community for not recognizing the severity of > TIKA-1600 when I voted in favor of rc1! > > Details... > > I reran against govdocs1, and there aren't any major surprises. > > On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that > Julien Nioche created for us, and I ran against that as well. That turned > up TIKA-1605 and another exceedingly rare NPE in the PDFParser. I don't > think either of these are blockers, and they're now fixed in trunk. > > There are slightly fewer metadata values for some jpegs. For the one file > that I manually reviewed, 1.8-rc was missing these values (that were > available in 1.7): > > JPEG quality > IPTC-NAA record > Plug-in 1 Data > > Comparison reports are available here (much more work remains to be done > on tika-eval): > > https://github.com/tballison/share/tree/master/tika_comparisons > > ________________________________________ > From: Tyler Palsulich <[email protected]> > Sent: Monday, April 13, 2015 1:56 PM > To: [email protected]; [email protected] > Subject: [VOTE] Apache Tika 1.8 Release Candidate #2 > > Hi Folks, > > A candidate for the Tika 1.8 release is available at: > https://dist.apache.org/repos/dist/dev/tika/ > > The release candidate is a zip archive of the sources in: > http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ > > The SHA1 checksum of the archive is > 5e22fee9079370398472e59082d171ae2d7fdd31. > > In addition, a staged maven repository is available here: > https://repository.apache.org/content/repositories/orgapachetika-1009 > > Please vote on releasing this package as Apache Tika 1.8. The vote is open > for the next 72 hours and passes if a majority of at least three +1 Tika > PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.8 > [ ] ±0 I don't object to this release, but I haven't checked it > [ ] -1 Do not release this package because... > > Thanks, > Tyler > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
