Hi Tilman,

  Thank you for raising this.  I noticed this, looked at a few and
then failed to document what this diff means.  Sorry!

   You're right that this has to do with TIKA-3962.  The issue is that
we are now correctly handling attachments within emls as attachments
rather than inlining the contents.  So this means that the main email
will have less content, but the content still should show up in an
attachment.  The challenge from an eval perspective is that there is
no attachment in 2.6.0 to which to map the new attachment in
2.7.0-prerc1.

  I attached the json output for 2.6.0 and 2.7.0 on
https://issues.apache.org/jira/browse/TIKA-3962.  It looks, btw, like
we fixed TIKA-2680 while we were at it. :D

  I'm going to look at a few more files.  If I find any problems, I'll
cancel the vote.

Thank you, again.

           Best,

                     Tim

On Tue, Jan 31, 2023 at 10:46 PM Tilman Hausherr <[email protected]> wrote:
>
> There is a block of "message/rfc822" files where TOP_10_MORE_IN_A has
> meaningful words, but TOP_10_MORE_IN_B is empty:
>
> bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip-1.mbox
> bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip
> bug_trackers/MOZILLA/153480-240296/MOZILLA-207156-4.zip-1.mbox
> commoncrawl3/V7/V73N7J3RSMYSQ7N5SEWKUOUCTSTJQCZM
> commoncrawl3/XD/XD7LX2GJWA7GZTCPKC3XYPJ5WYHWMCW2
> bug_trackers/TIKA/TIKA-2680-1.eml
> commoncrawl3/FH/FHAPPENOGJUVCIBTEFHDVYKXJAYEE77O
> govdocs1/446/446030.tmp
> govdocs1/330/330112.tmp
> govdocs1/994/994741.tmp
> bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> bug_trackers/MOZILLA/1240554-1312466/MOZILLA-1261295-0.zip
> bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
> commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
>
>
> I tested the one from https://issues.apache.org/jira/browse/TIKA-2680
> and I did get text results, so I'm wonder what the problem is, or if
> there is any problem at all. Or is this related to the changes in
> TIKA-3962 ?
>
> Tilman
>
> On 31.01.2023 18:40, Tim Allison wrote:
> > The reports comparing 2.6.0 with 2.7.0-prerc1 are here:
> > https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz
> >
> > Some observations:
> > * Many fewer "common words" in svg files because they are now
> > correctly identified as svg+xml files and getting parsed by the xml
> > parser.  We're no longer treating these as text and including all the
> > tags.  There are a couple of handfuls of "now svg" files that are
> > causing exceptions in the xml parser. Overall, I think this diff from
> > 2.6.0 is good.
> > * Our change in the charset detector has some improvements and some
> > regressions.  Overall, I still think we made the right call.
> > * Surprisingly, I don't see many diffs in the number of attachments in
> > rfc822 files.  I thought there would be more.
> >
> > I'll start the release process now.  Please do take a look and let me
> > know if you see any issues.  I'm happy to respin an rc2 if necessary.
> >
> > Thank you, all!
> >
> > Cheers,
> >
> >              Tim
> >
> > On Mon, Jan 30, 2023 at 11:14 AM Tim Allison<[email protected]>  wrote:
> >> All,
> >>    After I fix TIKA-3962, I'll start the regression tests in
> >> preparation for a 2.7.0 release.  Please let me know if there are any
> >> blockers or if you're working on something that you want to get into
> >> the next release.
> >>    Thank you!
> >>
> >>       Best,
> >>
> >>           Tim
> >>
> >> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison<[email protected]>  wrote:
> >>> All,
> >>>    I'm thinking we should cut a release in the next week or so.  I can
> >>> start the regression tests next week (possibly late in the week).  I
> >>> think that the changes move us into the "minor" version update, so
> >>> 2.7.0.
> >>>    WDYT?  Are there any imminent releases of our dependencies that we
> >>> should wait for?  Anything else we'd want to get into the next
> >>> release?
> >>>    Thank you!
> >>>
> >>>       Best,
> >>>
> >>>               Tim
>

Reply via email to