Y, I think this is an improvement because it was identified as xhtml by the earlier version of Tika, and it is now correctly being parsed by the rfc822 parser...and y, it is broken.
There were a number of other files that are now correctly identified as http-response, but we're getting less text because the files are truncated and the http-response parser is throwing an exception. On Wed, Apr 27, 2022 at 2:59 PM Tilman Hausherr <[email protected]> wrote: > > Am 27.04.2022 um 14:00 schrieb Tim Allison: > > Once we fix the ppt issue, I'll rerun the regression tests. Please > > let me know if you see anything else. > > commoncrawl3/5Y/5YX5CR7P7FVPZIMTBBPGQU5FULLMJOXM > > has lost a bit of extracted text, but that "mail" is broken. > > Tilman >
