Should OCR have been triggered on that page?  If so, and if you're not
doing ocr-only, you'll get both the junk text and ocr.
The ocr will be marked with <div/> markers so you can tell which is OCR.

On Fri, Dec 3, 2021 at 2:06 PM Peter Kronenberg <[email protected]>
wrote:

> Ok, might have spoken to soon.  Looks like the list of media types was
> truncated by IntellI and I had to click at the bottom to see all of them.
>
> Still trying to understand some of the weird things I’m seeing.  It looks
> like I’m getting a combination of OCR being done on the page as well as
> text extraction, but since the characters are unmapped, it looks like
> garbage.   Not sure why I’m seeing this.
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
> *From:* Peter Kronenberg
> *Sent:* Friday, December 3, 2021 9:57 AM
> *To:* <[email protected]> <[email protected]>; [email protected]
> *Subject:* Missing image/ocr-png type
>
>
>
> I’m having trouble parsing a PDF file.  It’s got a lot of weird things
> about it, but unfortunately, I can’t share it.  But I’m hoping it will be
> simple.
>
>
>
> Here’s a debugging screenshot from
> AbstractPDF2XHTML::doOCROnCurrentPage().  The ocrImageMediaType is
> image/ocr-png.  But that doesn’t appear in the list of supportedTypes.  I
> see where Tika builds the ocr types in TesseractOCRParser and ocr-png is in
> there, but I don’t see how that list gets transferred to the
> getSupportedTypes() call that is being made on AutoDetectParser
>
>
>
>
>
>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>

Reply via email to