[jira] [Commented] (TIKA-4432) Issue with EMF Parser Merging Header, Page Number, and First Content Word During Extraction

Tim Allison (Jira) Wed, 04 Jun 2025 05:38:26 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956099#comment-17956099
 ]


Tim Allison commented on TIKA-4432:
-----------------------------------

The challenge with this emf is that the reference for all the text chunks is 
0,0, and the bounds are all 0,0,0,0. 

This image is rendered correctly in at least LibreOffices' Draw, and, I'd 
guess, the microsoft tools as well.

There are bounds that we could use in the "boundsIgnored" field within the 
emftext, but the spec says to ignore that info. There's also location info in 
the modifyWorldTransform records that come before the exttextoutw records.

Maybe fall back to the boundsIgnored if the reference and bounds are all zeroes?

> Issue with EMF Parser Merging Header, Page Number, and First Content Word 
> During Extraction
> -------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4432
>                 URL: https://issues.apache.org/jira/browse/TIKA-4432
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Aashish Tudu
>            Priority: Major
>         Attachments: 102 emffile.EMF, 102 emffile.EMF_extracted_with_fix.jpg, 
> 102 emffile.EMF_extracted_with_fix.txt, 102 
> emffile.EMF_extracted_with_issue.jpg, 102 emffile.EMF_extracted_with_issue.txt
>
>
> It appears that the EMF parser is incorrectly combining multiple elements 
> during text extraction. Specifically, information from the document header, 
> the page number, and the first word of the main content are being merged into 
> a single word.
> Also attaching the extracted content in txt file and original emf file.
> !102 emffile.EMF_extracted_with_issue.jpg|width=794,height=190!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4432) Issue with EMF Parser Merging Header, Page Number, and First Content Word During Extraction

Reply via email to