[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423005#comment-16423005
 ] 

ASF GitHub Bot commented on TIKA-2624:
--------------------------------------

ewanmellor opened a new pull request #232: Fix for TIKA-2624 contributed by 
ewanmellor.
URL: https://github.com/apache/tika/pull/232
 
 
   Change AbstractPDF2XHTML.doOCROnCurrentPage to use the same DPI value
   (PDFParserConfig.ocrDPI) for both the PDF rendering and the image metadata.
   
   Previously, the PDF was being rendered using ocrImageScale (default 2.0 ==
   144dpi) and then putting ocrDPI (default 300) in the image metadata.  Having
   these two things be independent makes no sense, and is surely going to
   confuse Tesseract when the image metadata does not match the data.
   
   This change means that ocrDPI drives both values, and ocrImageScale is
   removed.  This also switches from PDFRenderer.renderImage to
   PDFRenderer.renderImageWithDPI, but that's just a stub to make it clearer
   what's going on.
   
   This change will have the side-effect that the temporary images between the
   PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will
   have a memory and temporary disk space impact, but it will ensure that the
   whole pipeline uses 300dpi by default.  People who have memory constraints
   will need to reduce ocrDPI and make the corresponding changes on the
   Tesseract side.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2624
>                 URL: https://issues.apache.org/jira/browse/TIKA-2624
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Ewan Mellor
>            Assignee: Tim Allison
>            Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to