Ok, might have spoken to soon.  Looks like the list of media types was 
truncated by IntellI and I had to click at the bottom to see all of them.
Still trying to understand some of the weird things I'm seeing.  It looks like 
I'm getting a combination of OCR being done on the page as well as text 
extraction, but since the characters are unmapped, it looks like garbage.   Not 
sure why I'm seeing this.

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Peter Kronenberg
Sent: Friday, December 3, 2021 9:57 AM
To: <[email protected]> <[email protected]>; [email protected]
Subject: Missing image/ocr-png type

I'm having trouble parsing a PDF file.  It's got a lot of weird things about 
it, but unfortunately, I can't share it.  But I'm hoping it will be simple.

Here's a debugging screenshot from AbstractPDF2XHTML::doOCROnCurrentPage().  
The ocrImageMediaType is image/ocr-png.  But that doesn't appear in the list of 
supportedTypes.  I see where Tika builds the ocr types in TesseractOCRParser 
and ocr-png is in there, but I don't see how that list gets transferred to the 
getSupportedTypes() call that is being made on AutoDetectParser



[cid:[email protected]]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


Reply via email to