[
https://issues.apache.org/jira/browse/TIKA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158571#comment-17158571
]
ASF GitHub Bot commented on TIKA-3134:
--------------------------------------
tballison commented on pull request #327:
URL: https://github.com/apache/tika/pull/327#issuecomment-658905767
@tothd91 thank you for opening this! It looks like there are quite a few
changes that are white-space only. Would it be possible to update so that the
diff includes only logic differences? Thank you!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> totalCharsPerPage and unmappedUnicodeCharsPerPage configuration
> ---------------------------------------------------------------
>
> Key: TIKA-3134
> URL: https://issues.apache.org/jira/browse/TIKA-3134
> Project: Tika
> Issue Type: Improvement
> Reporter: Dávid Tóth
> Priority: Major
>
> During PDF parsing, when the code decides to do OCR on a page, this decision
> is made in the endPage(PDPage page) method of the AbstractPDF2XHTML class,
> based on the number of the totalCharsPerPage or unmappedUnicodeCharsPerPage.
> If any of these is less than 10 (10 is a hardcoded number) the page will be
> handled by OCR. In our improvement we eliminated these hardcoded numbers and
> from now they are configurable in the PDFParserConfig class.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)