[
https://issues.apache.org/jira/browse/TIKA-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-4202.
-------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
> Add page count of OCR'd pages in metadata for PDF files
> -------------------------------------------------------
>
> Key: TIKA-4202
> URL: https://issues.apache.org/jira/browse/TIKA-4202
> Project: Tika
> Issue Type: New Feature
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 3.0.0
>
>
> It would be useful to store the number of pages that triggered OCR in PDFs.
> PDFs are treated differently than other files because the default is to
> render the page and then run OCR "inline", whereas for other file formats, we
> run OCR on embedded images, which are treated as embedded files. We can count
> tesseract as the parser for embedded images in regular files, but we can't do
> that with PDFs ... yet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)