[ 
https://issues.apache.org/jira/browse/PDFBOX-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Mellor closed PDFBOX-4178.
-------------------------------
    Resolution: Invalid

This was supposed to be reported against Tika. Ignore this ticket.  Apologies.

 

> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-4178
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4178
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.9
>            Reporter: Ewan Mellor
>            Priority: Major
>
> Tika has two properties in `PDFParser.properties` that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are `ocrDPI` (default 300) and `ocrImageScale` (default 2.0).
> `ocrDPI` is passed to `ImageIOUtil.writeImage`, which uses it as the metadata 
> in the image (i.e. it doesn't control scaling at all, it's just an advertised 
> metadata field).
> `ocrImageScale` is passed to PDFBox's `PDFRenderer.renderImage`, which uses 
> it to specify the scale for rendering.  This value is such that 1.0 == 72dpi, 
> and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove `ocrImageScale`, and use the same DPI 
> value for rendering as we advertise in the image metadata.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to