Re: preliminary regression results from 2.4.0

Tim Allison Wed, 27 Apr 2022 12:07:39 -0700

Y, I think this is an improvement because it was identified as xhtml
by the earlier version of Tika, and it is now correctly being parsed
by the rfc822 parser...and y, it is broken.

There were a number of other files that are now correctly identified
as http-response, but we're getting less text because the files are
truncated and the http-response parser is throwing an exception.

On Wed, Apr 27, 2022 at 2:59 PM Tilman Hausherr <[email protected]> wrote:
>
> Am 27.04.2022 um 14:00 schrieb Tim Allison:
> > Once we fix the ppt issue, I'll rerun the regression tests.  Please
> > let me know if you see anything else.
>
> commoncrawl3/5Y/5YX5CR7P7FVPZIMTBBPGQU5FULLMJOXM
>
> has lost a bit of extracted text, but that "mail" is broken.
>
> Tilman
>

Re: preliminary regression results from 2.4.0

Reply via email to