Should OCR have been triggered on that page? If so, and if you're not doing ocr-only, you'll get both the junk text and ocr. The ocr will be marked with <div/> markers so you can tell which is OCR.
On Fri, Dec 3, 2021 at 2:06 PM Peter Kronenberg <[email protected]> wrote: > Ok, might have spoken to soon. Looks like the list of media types was > truncated by IntellI and I had to click at the bottom to see all of them. > > Still trying to understand some of the weird things I’m seeing. It looks > like I’m getting a combination of OCR being done on the page as well as > text extraction, but since the characters are unmapped, it looks like > garbage. Not sure why I’m seeing this. > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] <http://www.torch.ai/> > > 4303 W. 119th St., Leawood, KS 66209 > WWW.TORCH.AI <http://www.torch.ai/> > > > > > > *From:* Peter Kronenberg > *Sent:* Friday, December 3, 2021 9:57 AM > *To:* <[email protected]> <[email protected]>; [email protected] > *Subject:* Missing image/ocr-png type > > > > I’m having trouble parsing a PDF file. It’s got a lot of weird things > about it, but unfortunately, I can’t share it. But I’m hoping it will be > simple. > > > > Here’s a debugging screenshot from > AbstractPDF2XHTML::doOCROnCurrentPage(). The ocrImageMediaType is > image/ocr-png. But that doesn’t appear in the list of supportedTypes. I > see where Tika builds the ocr types in TesseractOCRParser and ocr-png is in > there, but I don’t see how that list gets transferred to the > getSupportedTypes() call that is being made on AutoDetectParser > > > > > > > > > > *Peter Kronenberg* *| * *Senior AI Analytic ENGINEER * > > *C: 703.887.5623* > > [image: Torch AI] <http://www.torch.ai/> > > 4303 W. 119th St., Leawood, KS 66209 > WWW.TORCH.AI <http://www.torch.ai/> > > > > >
