[
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2624.
-------------------------------
Fix Version/s: 1.23
Resolution: Fixed
Thank you, [~ewanmellor-2] and [~epugh]!
> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> ---------------------------------------------------------------------
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.17
> Reporter: Ewan Mellor
> Assignee: Tim Allison
> Priority: Major
> Fix For: 1.23
>
>
> Tika has two properties in {{PDFParser.properties}} that control what happens
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract
> for OCR. These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the
> metadata in the image (i.e. it doesn't control scaling at all, it's just an
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which
> uses it to specify the scale for rendering. This value is such that 1.0 ==
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then
> advertising 300dpi in the image metadata. This makes no sense to me, and is
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at
> 300dpi by default, so this will mean that all stages between PDFRenderer and
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi). This will
> have a memory and temporary disk space impact, but I think that it's still
> best to have the whole pipeline using 300dpi. People who have memory
> constraints will need to reduce ocrDPI and make the corresponding changes on
> the Tesseract side.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)