[ https://issues.apache.org/jira/browse/PDFBOX-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ewan Mellor closed PDFBOX-4178. ------------------------------- Resolution: Invalid This was supposed to be reported against Tika. Ignore this ticket. Apologies. > Rendering PDFs for OCR with Tesseract uses different DPI than claimed > --------------------------------------------------------------------- > > Key: PDFBOX-4178 > URL: https://issues.apache.org/jira/browse/PDFBOX-4178 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.9 > Reporter: Ewan Mellor > Priority: Major > > Tika has two properties in `PDFParser.properties` that control what happens > in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract > for OCR. These are `ocrDPI` (default 300) and `ocrImageScale` (default 2.0). > `ocrDPI` is passed to `ImageIOUtil.writeImage`, which uses it as the metadata > in the image (i.e. it doesn't control scaling at all, it's just an advertised > metadata field). > `ocrImageScale` is passed to PDFBox's `PDFRenderer.renderImage`, which uses > it to specify the scale for rendering. This value is such that 1.0 == 72dpi, > and therefore Tika's default is to request 144dpi for rendering. > This means that Tika is asking PDFBox to render at 144dpi, and then > advertising 300dpi in the image metadata. This makes no sense to me, and is > surely going to confuse Tesseract. > Instead of doing this, we should remove `ocrImageScale`, and use the same DPI > value for rendering as we advertise in the image metadata. > We should keep the existing default DPI value, since Tesseract is trained at > 300dpi by default, so this will mean that all stages between PDFRenderer and > Tesseract are defaulting to 300dpi. > This change will have the side-effect that the temporary images between the > PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi). This will > have a memory and temporary disk space impact, but I think that it's still > best to have the whole pipeline using 300dpi. People who have memory > constraints will need to reduce ocrDPI and make the corresponding changes on > the Tesseract side. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org