[
https://issues.apache.org/jira/browse/TIKA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158686#comment-17158686
]
Hudson commented on TIKA-3134:
------------------------------
SUCCESS: Integrated in Jenkins build tika-branch-1x #346 (See
[https://builds.apache.org/job/tika-branch-1x/346/])
TIKA-3134 -- fix bug and add unit tests (tallison:
[https://github.com/apache/tika/commit/e57c832a56b7917ff6da01af129c909aaa2ccf69])
* (edit)
tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java
* (edit)
tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java
> totalCharsPerPage and unmappedUnicodeCharsPerPage configuration
> ---------------------------------------------------------------
>
> Key: TIKA-3134
> URL: https://issues.apache.org/jira/browse/TIKA-3134
> Project: Tika
> Issue Type: Improvement
> Reporter: Dávid Tóth
> Priority: Major
>
> During PDF parsing, when the code decides to do OCR on a page, this decision
> is made in the endPage(PDPage page) method of the AbstractPDF2XHTML class,
> based on the number of the totalCharsPerPage or unmappedUnicodeCharsPerPage.
> If any of these is less than 10 (10 is a hardcoded number) the page will be
> handled by OCR. In our improvement we eliminated these hardcoded numbers and
> from now they are configurable in the PDFParserConfig class.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)