Y. I've looked at a few others, and this is exactly what's going on.
On Wed, Feb 1, 2023 at 2:20 PM Tim Allison <[email protected]> wrote: > > Hi Tilman, > > Thank you for raising this. I noticed this, looked at a few and > then failed to document what this diff means. Sorry! > > You're right that this has to do with TIKA-3962. The issue is that > we are now correctly handling attachments within emls as attachments > rather than inlining the contents. So this means that the main email > will have less content, but the content still should show up in an > attachment. The challenge from an eval perspective is that there is > no attachment in 2.6.0 to which to map the new attachment in > 2.7.0-prerc1. > > I attached the json output for 2.6.0 and 2.7.0 on > https://issues.apache.org/jira/browse/TIKA-3962. It looks, btw, like > we fixed TIKA-2680 while we were at it. :D > > I'm going to look at a few more files. If I find any problems, I'll > cancel the vote. > > Thank you, again. > > Best, > > Tim > > On Tue, Jan 31, 2023 at 10:46 PM Tilman Hausherr <[email protected]> > wrote: > > > > There is a block of "message/rfc822" files where TOP_10_MORE_IN_A has > > meaningful words, but TOP_10_MORE_IN_B is empty: > > > > bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip-1.mbox > > bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip > > bug_trackers/MOZILLA/153480-240296/MOZILLA-207156-4.zip-1.mbox > > commoncrawl3/V7/V73N7J3RSMYSQ7N5SEWKUOUCTSTJQCZM > > commoncrawl3/XD/XD7LX2GJWA7GZTCPKC3XYPJ5WYHWMCW2 > > bug_trackers/TIKA/TIKA-2680-1.eml > > commoncrawl3/FH/FHAPPENOGJUVCIBTEFHDVYKXJAYEE77O > > govdocs1/446/446030.tmp > > govdocs1/330/330112.tmp > > govdocs1/994/994741.tmp > > bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip > > bug_trackers/MOZILLA/1240554-1312466/MOZILLA-1261295-0.zip > > bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip > > commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW > > commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW > > > > > > I tested the one from https://issues.apache.org/jira/browse/TIKA-2680 > > and I did get text results, so I'm wonder what the problem is, or if > > there is any problem at all. Or is this related to the changes in > > TIKA-3962 ? > > > > Tilman > > > > On 31.01.2023 18:40, Tim Allison wrote: > > > The reports comparing 2.6.0 with 2.7.0-prerc1 are here: > > > https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz > > > > > > Some observations: > > > * Many fewer "common words" in svg files because they are now > > > correctly identified as svg+xml files and getting parsed by the xml > > > parser. We're no longer treating these as text and including all the > > > tags. There are a couple of handfuls of "now svg" files that are > > > causing exceptions in the xml parser. Overall, I think this diff from > > > 2.6.0 is good. > > > * Our change in the charset detector has some improvements and some > > > regressions. Overall, I still think we made the right call. > > > * Surprisingly, I don't see many diffs in the number of attachments in > > > rfc822 files. I thought there would be more. > > > > > > I'll start the release process now. Please do take a look and let me > > > know if you see any issues. I'm happy to respin an rc2 if necessary. > > > > > > Thank you, all! > > > > > > Cheers, > > > > > > Tim > > > > > > On Mon, Jan 30, 2023 at 11:14 AM Tim Allison<[email protected]> wrote: > > >> All, > > >> After I fix TIKA-3962, I'll start the regression tests in > > >> preparation for a 2.7.0 release. Please let me know if there are any > > >> blockers or if you're working on something that you want to get into > > >> the next release. > > >> Thank you! > > >> > > >> Best, > > >> > > >> Tim > > >> > > >> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison<[email protected]> wrote: > > >>> All, > > >>> I'm thinking we should cut a release in the next week or so. I can > > >>> start the regression tests next week (possibly late in the week). I > > >>> think that the changes move us into the "minor" version update, so > > >>> 2.7.0. > > >>> WDYT? Are there any imminent releases of our dependencies that we > > >>> should wait for? Anything else we'd want to get into the next > > >>> release? > > >>> Thank you! > > >>> > > >>> Best, > > >>> > > >>> Tim > >
