Just finished the run against ~2.8 million docs (4.8 million including 
attachments) from a combination of govdocs1 and Common Crawl.  I compared 1.9 
with trunk.

Most looks good.

Some highlights:
* Thanks to Andrew Jackson and TIKA-1678, we're now getting better metadata out 
of ~1300 from 550k PDFs. This appears to be far more common in Common Crawl 
PDFs than in govdocs1 PDFs.
* No significant changes found in the handful of msg files...I wanted to check 
after the work on TIKA-1238.
* Thanks to Andreas Beeker and TIKA-1046/POI 54332, there are far fewer PPT 
exceptions
* There are a very few more files in CommonCrawl that are now incorrectly 
identified as RFC vs text (TIKA-1602), but this is a tiny handful (total of 4 
documents in both CC and govdocs1)

A regret:
This run used the digesting parser for both container and embedded files.  This 
causes some truncated (=corrupt) package files to throw an exception before 
they otherwise would.  The opposite happens, too (more embedded files when 
using the digester), but this is extremely rare. This means that for truncated 
gz, x-xz and x-archive files there are many more with fewer attachments in Tika 
1.10-SNAPSHOT than in Tika 1.9.

With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape for 
1.10...from my perspective.

             Best,

                       Tim
-----Original Message-----
From: David Meikle [mailto:[email protected]] 
Sent: Sunday, July 26, 2015 10:50 AM
To: [email protected]
Subject: Re: release Tika 1.10?


> On 23 Jul 2015, at 14:07, Allison, Timothy B. <[email protected]> wrote:
> 
>  With the fix of TIKA-1690, I think it makes sense to roll a new release 
> (1.10) in the next week or so.  I'd like to get TIKA-1667 (upgrade poi) in 
> before the release.  Are there any other blockers on 1.10?

+1 from me too.  As discussed on private, I will roll the release on Tuesday 
night (UK Time) to give people time to shout for other candidates.

Cheers,
Dave

Reply via email to