Y.  I've looked at a few others, and this is exactly what's going on.

On Wed, Feb 1, 2023 at 2:20 PM Tim Allison <[email protected]> wrote:
>
> Hi Tilman,
>
>   Thank you for raising this.  I noticed this, looked at a few and
> then failed to document what this diff means.  Sorry!
>
>    You're right that this has to do with TIKA-3962.  The issue is that
> we are now correctly handling attachments within emls as attachments
> rather than inlining the contents.  So this means that the main email
> will have less content, but the content still should show up in an
> attachment.  The challenge from an eval perspective is that there is
> no attachment in 2.6.0 to which to map the new attachment in
> 2.7.0-prerc1.
>
>   I attached the json output for 2.6.0 and 2.7.0 on
> https://issues.apache.org/jira/browse/TIKA-3962.  It looks, btw, like
> we fixed TIKA-2680 while we were at it. :D
>
>   I'm going to look at a few more files.  If I find any problems, I'll
> cancel the vote.
>
> Thank you, again.
>
>            Best,
>
>                      Tim
>
> On Tue, Jan 31, 2023 at 10:46 PM Tilman Hausherr <[email protected]> 
> wrote:
> >
> > There is a block of "message/rfc822" files where TOP_10_MORE_IN_A has
> > meaningful words, but TOP_10_MORE_IN_B is empty:
> >
> > bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip-1.mbox
> > bug_trackers/MOZILLA/1623669-1673165/MOZILLA-1633982-0.zip
> > bug_trackers/MOZILLA/153480-240296/MOZILLA-207156-4.zip-1.mbox
> > commoncrawl3/V7/V73N7J3RSMYSQ7N5SEWKUOUCTSTJQCZM
> > commoncrawl3/XD/XD7LX2GJWA7GZTCPKC3XYPJ5WYHWMCW2
> > bug_trackers/TIKA/TIKA-2680-1.eml
> > commoncrawl3/FH/FHAPPENOGJUVCIBTEFHDVYKXJAYEE77O
> > govdocs1/446/446030.tmp
> > govdocs1/330/330112.tmp
> > govdocs1/994/994741.tmp
> > bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> > bug_trackers/MOZILLA/1240554-1312466/MOZILLA-1261295-0.zip
> > bug_trackers/MOZILLA/479198-507575/MOZILLA-505221-1.zip
> > commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
> > commoncrawl3/3N/3NI3JJHBV2QG4DERKNNWMQCL3AJHRCCW
> >
> >
> > I tested the one from https://issues.apache.org/jira/browse/TIKA-2680
> > and I did get text results, so I'm wonder what the problem is, or if
> > there is any problem at all. Or is this related to the changes in
> > TIKA-3962 ?
> >
> > Tilman
> >
> > On 31.01.2023 18:40, Tim Allison wrote:
> > > The reports comparing 2.6.0 with 2.7.0-prerc1 are here:
> > > https://corpora.tika.apache.org/base/reports/tika-2.6.0-vs-2.7.0-prerc1-reports.tgz
> > >
> > > Some observations:
> > > * Many fewer "common words" in svg files because they are now
> > > correctly identified as svg+xml files and getting parsed by the xml
> > > parser.  We're no longer treating these as text and including all the
> > > tags.  There are a couple of handfuls of "now svg" files that are
> > > causing exceptions in the xml parser. Overall, I think this diff from
> > > 2.6.0 is good.
> > > * Our change in the charset detector has some improvements and some
> > > regressions.  Overall, I still think we made the right call.
> > > * Surprisingly, I don't see many diffs in the number of attachments in
> > > rfc822 files.  I thought there would be more.
> > >
> > > I'll start the release process now.  Please do take a look and let me
> > > know if you see any issues.  I'm happy to respin an rc2 if necessary.
> > >
> > > Thank you, all!
> > >
> > > Cheers,
> > >
> > >              Tim
> > >
> > > On Mon, Jan 30, 2023 at 11:14 AM Tim Allison<[email protected]>  wrote:
> > >> All,
> > >>    After I fix TIKA-3962, I'll start the regression tests in
> > >> preparation for a 2.7.0 release.  Please let me know if there are any
> > >> blockers or if you're working on something that you want to get into
> > >> the next release.
> > >>    Thank you!
> > >>
> > >>       Best,
> > >>
> > >>           Tim
> > >>
> > >> On Thu, Jan 19, 2023 at 10:15 AM Tim Allison<[email protected]>  wrote:
> > >>> All,
> > >>>    I'm thinking we should cut a release in the next week or so.  I can
> > >>> start the regression tests next week (possibly late in the week).  I
> > >>> think that the changes move us into the "minor" version update, so
> > >>> 2.7.0.
> > >>>    WDYT?  Are there any imminent releases of our dependencies that we
> > >>> should wait for?  Anything else we'd want to get into the next
> > >>> release?
> > >>>    Thank you!
> > >>>
> > >>>       Best,
> > >>>
> > >>>               Tim
> >

Reply via email to