The reports comparing 2.6.0 with 2.7.0-prerc1 are here:
https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz
Some observations:
* Many fewer "common words" in svg files because they are now
correctly identified as svg+xml files and getting parsed by the xml
parser. We're no longer treating these as text and including all the
tags. There are a couple of handfuls of "now svg" files that are
causing exceptions in the xml parser. Overall, I think this diff from
2.6.0 is good.
* Our change in the charset detector has some improvements and some
regressions. Overall, I still think we made the right call.
* Surprisingly, I don't see many diffs in the number of attachments in
rfc822 files. I thought there would be more.
I'll start the release process now. Please do take a look and let me
know if you see any issues. I'm happy to respin an rc2 if necessary.
Thank you, all!
Cheers,
Tim
On Mon, Jan 30, 2023 at 11:14 AM Tim Allison <[email protected]> wrote:
>
> All,
> After I fix TIKA-3962, I'll start the regression tests in
> preparation for a 2.7.0 release. Please let me know if there are any
> blockers or if you're working on something that you want to get into
> the next release.
> Thank you!
>
> Best,
>
> Tim
>
> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison <[email protected]> wrote:
> >
> > All,
> > I'm thinking we should cut a release in the next week or so. I can
> > start the regression tests next week (possibly late in the week). I
> > think that the changes move us into the "minor" version update, so
> > 2.7.0.
> > WDYT? Are there any imminent releases of our dependencies that we
> > should wait for? Anything else we'd want to get into the next
> > release?
> > Thank you!
> >
> > Best,
> >
> > Tim