[jira] [Commented] (TIKA-4209) Improve handling of multi-image tiffs

Tim Allison (Jira) Tue, 12 Mar 2024 08:31:09 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825717#comment-17825717
 ]


Tim Allison commented on TIKA-4209:
-----------------------------------

I just tested the 3 files with tesseract, and I confirmed, at least, that 
tesseract does perform ocr on each image within the multi-image tiffs. This is 
good.

We still need to do more on the Tika side, but this is a bit of a relief.

> Improve handling of multi-image tiffs
> -------------------------------------
>
>                 Key: TIKA-4209
>                 URL: https://issues.apache.org/jira/browse/TIKA-4209
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Major
>
> [~johanvanderknijff] recently published a great post on multi-image TIFFs: 
> [https://www.bitsgalore.org/2024/03/11/multi-image-tiffs-subfiles-and-image-file-directories]
> I hadn't worked on TIFF in a while. I tried out a few sample multi-image 
> tiffs and found that we are not processing anything beyond the first 
> page/image in a TIFF. Even worse, we're not populating our 
> "{color:#000000}imagereader:NumImages{color}" metadata value for TIFFs.
> It looks like Drew Noakes' metadata-extractor is not yet handling these well: 
> [https://github.com/drewnoakes/metadata-extractor/issues/648]
>  
> There's an example file on that issue: 
> [https://github.com/drewnoakes/metadata-extractor/files/14052854/color-pages-jpg.zip]
> And [~johanvanderknijff] also pointed out to TIFFs available here: 
> [https://www.leadtools.com/support/forum/posts/t10960-]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4209) Improve handling of multi-image tiffs

Reply via email to