[ 
https://issues.apache.org/jira/browse/TIKA-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4511:
------------------------------
    Description: 
tesseract at least as recently as 5.4.1 with leptonica-1.82.0 cannot process 
compressed bmp. See: [https://github.com/tesseract-ocr/tesseract/issues/2558]

For OCR to work on these images, we'd have to use imagemagick or something 
similar to convert these to uncompressed bmp. As a first step, we'd need to 
detect compressed bmp vs uncompressed.

This ticket focuses solely on detection.

It looks like if the byte at 30 is non-zero, then we have a compressed bmp: 
[https://en.wikipedia.org/wiki/BMP_file_format]

 

I propose: {{image/bmp;format=compressed}}

  was:
tesseract at least as recently as 5.4.1 with leptonica-1.82.0 cannot process 
compressed bmp. See: [https://github.com/tesseract-ocr/tesseract/issues/2558]

For OCR to work on these images, we'd have to use imagemagick or something 
similar to convert these to uncompressed bmp. As a first step, we'd need to 
detect compressed bmp vs uncompressed.

This ticket focuses solely on detection.

It looks like if the byte at 30 is non-zero, then we have a compressed bmp: 
[https://en.wikipedia.org/wiki/BMP_file_format]

 

I propose: {{image/bmp; compressed=true}}


> Detect compressed bmp
> ---------------------
>
>                 Key: TIKA-4511
>                 URL: https://issues.apache.org/jira/browse/TIKA-4511
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>
> tesseract at least as recently as 5.4.1 with leptonica-1.82.0 cannot process 
> compressed bmp. See: [https://github.com/tesseract-ocr/tesseract/issues/2558]
> For OCR to work on these images, we'd have to use imagemagick or something 
> similar to convert these to uncompressed bmp. As a first step, we'd need to 
> detect compressed bmp vs uncompressed.
> This ticket focuses solely on detection.
> It looks like if the byte at 30 is non-zero, then we have a compressed bmp: 
> [https://en.wikipedia.org/wiki/BMP_file_format]
>  
> I propose: {{image/bmp;format=compressed}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to