Dávid Tóth created TIKA-3134:
--------------------------------
Summary: totalCharsPerPage and unmappedUnicodeCharsPerPage
configuration
Key: TIKA-3134
URL: https://issues.apache.org/jira/browse/TIKA-3134
Project: Tika
Issue Type: Improvement
Reporter: Dávid Tóth
During PDF parsing, when the code decides to do OCR on a page, this decision is
made in the endPage(PDPage page) method of the AbstractPDF2XHTML class, based
on the number of the totalCharsPerPage or unmappedUnicodeCharsPerPage. If any
of these is less than 10 (10 is a hardcoded number) the page will be handled by
OCR. In our improvement we eliminated these hardcoded numbers and from now they
are configurable in the PDFParserConfig class.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)