[jira] [Commented] (TIKA-3134) totalCharsPerPage and unmappedUnicodeCharsPerPage configuration

Hudson (Jira) Wed, 15 Jul 2020 13:15:29 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158686#comment-17158686
 ]


Hudson commented on TIKA-3134:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x #346 (See 
[https://builds.apache.org/job/tika-branch-1x/346/])
TIKA-3134 -- fix bug and add unit tests (tallison: 
[https://github.com/apache/tika/commit/e57c832a56b7917ff6da01af129c909aaa2ccf69])
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java
* (edit) 
tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java


> totalCharsPerPage and unmappedUnicodeCharsPerPage configuration
> ---------------------------------------------------------------
>
>                 Key: TIKA-3134
>                 URL: https://issues.apache.org/jira/browse/TIKA-3134
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Dávid Tóth
>            Priority: Major
>
> During PDF parsing, when the code decides to do OCR on a page, this decision 
> is made in the endPage(PDPage page) method of the AbstractPDF2XHTML class, 
> based on the number of the totalCharsPerPage or unmappedUnicodeCharsPerPage. 
> If any of these is less than 10 (10 is a hardcoded number) the page will be 
> handled by OCR. In our improvement we eliminated these hardcoded numbers and 
> from now they are configurable in the PDFParserConfig class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3134) totalCharsPerPage and unmappedUnicodeCharsPerPage configuration

Reply via email to