Tim Allison created TIKA-4511:
---------------------------------
Summary: Detect compressed bmp
Key: TIKA-4511
URL: https://issues.apache.org/jira/browse/TIKA-4511
Project: Tika
Issue Type: Task
Reporter: Tim Allison
tesseract at least as recently as 5.4.1 with leptonica-1.82.0 cannot process
compressed bmp. See: [https://github.com/tesseract-ocr/tesseract/issues/2558]
For OCR to work on these images, we'd have to use imagemagick or something
similar to convert these to uncompressed bmp. As a first step, we'd need to
detect compressed bmp vs uncompressed.
This ticket focuses solely on detection.
It looks like if the byte at 30 is non-zero, then we have a compressed bmp:
[https://en.wikipedia.org/wiki/BMP_file_format]
I propose: {{image/bmp; compressed=true}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)