[jira] [Closed] (PDFBOX-4178) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Mellor closed PDFBOX-4178.
---
Resolution: Invalid

This was supposed to be reported against Tika. Ignore this ticket.  Apologies.

 

> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: PDFBOX-4178
> URL: https://issues.apache.org/jira/browse/PDFBOX-4178
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.9
>Reporter: Ewan Mellor
>Priority: Major
>
> Tika has two properties in `PDFParser.properties` that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are `ocrDPI` (default 300) and `ocrImageScale` (default 2.0).
> `ocrDPI` is passed to `ImageIOUtil.writeImage`, which uses it as the metadata 
> in the image (i.e. it doesn't control scaling at all, it's just an advertised 
> metadata field).
> `ocrImageScale` is passed to PDFBox's `PDFRenderer.renderImage`, which uses 
> it to specify the scale for rendering.  This value is such that 1.0 == 72dpi, 
> and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove `ocrImageScale`, and use the same DPI 
> value for rendering as we advertise in the image metadata.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4178) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2018-04-02 Thread Ewan Mellor (JIRA)
Ewan Mellor created PDFBOX-4178:
---

 Summary: Rendering PDFs for OCR with Tesseract uses different DPI 
than claimed
 Key: PDFBOX-4178
 URL: https://issues.apache.org/jira/browse/PDFBOX-4178
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.9
Reporter: Ewan Mellor


Tika has two properties in `PDFParser.properties` that control what happens in 
AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract for 
OCR.  These are `ocrDPI` (default 300) and `ocrImageScale` (default 2.0).

`ocrDPI` is passed to `ImageIOUtil.writeImage`, which uses it as the metadata 
in the image (i.e. it doesn't control scaling at all, it's just an advertised 
metadata field).

`ocrImageScale` is passed to PDFBox's `PDFRenderer.renderImage`, which uses it 
to specify the scale for rendering.  This value is such that 1.0 == 72dpi, and 
therefore Tika's default is to request 144dpi for rendering.

This means that Tika is asking PDFBox to render at 144dpi, and then advertising 
300dpi in the image metadata.  This makes no sense to me, and is surely going 
to confuse Tesseract.

Instead of doing this, we should remove `ocrImageScale`, and use the same DPI 
value for rendering as we advertise in the image metadata.

We should keep the existing default DPI value, since Tesseract is trained at 
300dpi by default, so this will mean that all stages between PDFRenderer and 
Tesseract are defaulting to 300dpi.

This change will have the side-effect that the temporary images between the PDF 
rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will have a 
memory and temporary disk space impact, but I think that it's still best to 
have the whole pipeline using 300dpi.  People who have memory constraints will 
need to reduce ocrDPI and make the corresponding changes on the Tesseract side.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org