Just finished the run against ~2.8 million docs (4.8 million including
attachments) from a combination of govdocs1 and Common Crawl. I compared 1.9
with trunk.
Most looks good.
Some highlights:
* Thanks to Andrew Jackson and TIKA-1678, we're now getting better metadata out
of ~1300 from 550k PDFs. This appears to be far more common in Common Crawl
PDFs than in govdocs1 PDFs.
* No significant changes found in the handful of msg files...I wanted to check
after the work on TIKA-1238.
* Thanks to Andreas Beeker and TIKA-1046/POI 54332, there are far fewer PPT
exceptions
* There are a very few more files in CommonCrawl that are now incorrectly
identified as RFC vs text (TIKA-1602), but this is a tiny handful (total of 4
documents in both CC and govdocs1)
A regret:
This run used the digesting parser for both container and embedded files. This
causes some truncated (=corrupt) package files to throw an exception before
they otherwise would. The opposite happens, too (more embedded files when
using the digester), but this is extremely rare. This means that for truncated
gz, x-xz and x-archive files there are many more with fewer attachments in Tika
1.10-SNAPSHOT than in Tika 1.9.
With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape for
1.10...from my perspective.
Best,
Tim
-----Original Message-----
From: David Meikle [mailto:[email protected]]
Sent: Sunday, July 26, 2015 10:50 AM
To: [email protected]
Subject: Re: release Tika 1.10?
> On 23 Jul 2015, at 14:07, Allison, Timothy B. <[email protected]> wrote:
>
> With the fix of TIKA-1690, I think it makes sense to roll a new release
> (1.10) in the next week or so. I'd like to get TIKA-1667 (upgrade poi) in
> before the release. Are there any other blockers on 1.10?
+1 from me too. As discussed on private, I will roll the release on Tuesday
night (UK Time) to give people time to shout for other candidates.
Cheers,
Dave