[
https://issues.apache.org/jira/browse/TIKA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158226#comment-17158226
]
ASF GitHub Bot commented on TIKA-3134:
--------------------------------------
tothd91 opened a new pull request #327:
URL: https://github.com/apache/tika/pull/327
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> totalCharsPerPage and unmappedUnicodeCharsPerPage configuration
> ---------------------------------------------------------------
>
> Key: TIKA-3134
> URL: https://issues.apache.org/jira/browse/TIKA-3134
> Project: Tika
> Issue Type: Improvement
> Reporter: Dávid Tóth
> Priority: Major
>
> During PDF parsing, when the code decides to do OCR on a page, this decision
> is made in the endPage(PDPage page) method of the AbstractPDF2XHTML class,
> based on the number of the totalCharsPerPage or unmappedUnicodeCharsPerPage.
> If any of these is less than 10 (10 is a hardcoded number) the page will be
> handled by OCR. In our improvement we eliminated these hardcoded numbers and
> from now they are configurable in the PDFParserConfig class.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)