[
https://issues.apache.org/jira/browse/PDFBOX-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ewan Mellor closed PDFBOX-4178.
---
Resolution: Invalid
This was supposed to be reported against Tika. Ignore this ticket. Apologies.
> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: PDFBOX-4178
> URL: https://issues.apache.org/jira/browse/PDFBOX-4178
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
>Affects Versions: 2.0.9
>Reporter: Ewan Mellor
>Priority: Major
>
> Tika has two properties in `PDFParser.properties` that control what happens
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract
> for OCR. These are `ocrDPI` (default 300) and `ocrImageScale` (default 2.0).
> `ocrDPI` is passed to `ImageIOUtil.writeImage`, which uses it as the metadata
> in the image (i.e. it doesn't control scaling at all, it's just an advertised
> metadata field).
> `ocrImageScale` is passed to PDFBox's `PDFRenderer.renderImage`, which uses
> it to specify the scale for rendering. This value is such that 1.0 == 72dpi,
> and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then
> advertising 300dpi in the image metadata. This makes no sense to me, and is
> surely going to confuse Tesseract.
> Instead of doing this, we should remove `ocrImageScale`, and use the same DPI
> value for rendering as we advertise in the image metadata.
> We should keep the existing default DPI value, since Tesseract is trained at
> 300dpi by default, so this will mean that all stages between PDFRenderer and
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi). This will
> have a memory and temporary disk space impact, but I think that it's still
> best to have the whole pipeline using 300dpi. People who have memory
> constraints will need to reduce ocrDPI and make the corresponding changes on
> the Tesseract side.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org