Ok, might have spoken to soon. Looks like the list of media types was truncated by IntellI and I had to click at the bottom to see all of them. Still trying to understand some of the weird things I'm seeing. It looks like I'm getting a combination of OCR being done on the page as well as text extraction, but since the characters are unmapped, it looks like garbage. Not sure why I'm seeing this.
Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<http://www.torch.ai/> 4303 W. 119th St., Leawood, KS 66209 WWW.TORCH.AI<http://www.torch.ai/> From: Peter Kronenberg Sent: Friday, December 3, 2021 9:57 AM To: <[email protected]> <[email protected]>; [email protected] Subject: Missing image/ocr-png type I'm having trouble parsing a PDF file. It's got a lot of weird things about it, but unfortunately, I can't share it. But I'm hoping it will be simple. Here's a debugging screenshot from AbstractPDF2XHTML::doOCROnCurrentPage(). The ocrImageMediaType is image/ocr-png. But that doesn't appear in the list of supportedTypes. I see where Tika builds the ocr types in TesseractOCRParser and ocr-png is in there, but I don't see how that list gets transferred to the getSupportedTypes() call that is being made on AutoDetectParser [cid:[email protected]] Peter Kronenberg | Senior AI Analytic ENGINEER C: 703.887.5623 [Torch AI]<http://www.torch.ai/> 4303 W. 119th St., Leawood, KS 66209 WWW.TORCH.AI<http://www.torch.ai/>
