Y, I think this is an improvement because it was identified as xhtml
by the earlier version of Tika, and it is now correctly being parsed
by the rfc822 parser...and y, it is broken.

There were a number of other files that are now correctly identified
as http-response, but we're getting less text because the files are
truncated and the http-response parser is throwing an exception.

On Wed, Apr 27, 2022 at 2:59 PM Tilman Hausherr <[email protected]> wrote:
>
> Am 27.04.2022 um 14:00 schrieb Tim Allison:
> > Once we fix the ppt issue, I'll rerun the regression tests.  Please
> > let me know if you see anything else.
>
> commoncrawl3/5Y/5YX5CR7P7FVPZIMTBBPGQU5FULLMJOXM
>
> has lost a bit of extracted text, but that "mail" is broken.
>
> Tilman
>

Reply via email to