[
https://issues.apache.org/jira/browse/TIKA-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825717#comment-17825717
]
Tim Allison commented on TIKA-4209:
-----------------------------------
I just tested the 3 files with tesseract, and I confirmed, at least, that
tesseract does perform ocr on each image within the multi-image tiffs. This is
good.
We still need to do more on the Tika side, but this is a bit of a relief.
> Improve handling of multi-image tiffs
> -------------------------------------
>
> Key: TIKA-4209
> URL: https://issues.apache.org/jira/browse/TIKA-4209
> Project: Tika
> Issue Type: New Feature
> Reporter: Tim Allison
> Priority: Major
>
> [~johanvanderknijff] recently published a great post on multi-image TIFFs:
> [https://www.bitsgalore.org/2024/03/11/multi-image-tiffs-subfiles-and-image-file-directories]
> I hadn't worked on TIFF in a while. I tried out a few sample multi-image
> tiffs and found that we are not processing anything beyond the first
> page/image in a TIFF. Even worse, we're not populating our
> "{color:#000000}imagereader:NumImages{color}" metadata value for TIFFs.
> It looks like Drew Noakes' metadata-extractor is not yet handling these well:
> [https://github.com/drewnoakes/metadata-extractor/issues/648]
>
> There's an example file on that issue:
> [https://github.com/drewnoakes/metadata-extractor/files/14052854/color-pages-jpg.zip]
> And [~johanvanderknijff] also pointed out to TIFFs available here:
> [https://www.leadtools.com/support/forum/posts/t10960-]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)