[jira] [Commented] (TIKA-4432) Issue with EMF Parser Merging Header, Page Number, and First Content Word During Extraction

ASF GitHub Bot (Jira) Wed, 04 Jun 2025 04:36:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956084#comment-17956084
 ]


ASF GitHub Bot commented on TIKA-4432:
--------------------------------------

aashishtudu opened a new pull request, #2241:
URL: https://github.com/apache/tika/pull/2241

   It appears that the EMF parser is incorrectly combining multiple elements 
during text extraction. Specifically, information from the document header, the 
page number, and the first word of the main content are being merged into a 
single word.
   
   The issue stems from how text fragments (EmfExtTextOutW records) are 
appended to a buffer. In cases where the EMF text does not include a trailing 
space, the next fragment gets concatenated directly without separation. The 
logic in PR ensures that if a text fragment does not end with a space, an 
explicit space is added between it and the next fragment.




> Issue with EMF Parser Merging Header, Page Number, and First Content Word 
> During Extraction
> -------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4432
>                 URL: https://issues.apache.org/jira/browse/TIKA-4432
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Aashish Tudu
>            Priority: Major
>         Attachments: 102 emffile.EMF, 102 
> emffile.EMF_extracted_with_issue.jpg, 102 emffile.EMF_extracted_with_issue.txt
>
>
> It appears that the EMF parser is incorrectly combining multiple elements 
> during text extraction. Specifically, information from the document header, 
> the page number, and the first word of the main content are being merged into 
> a single word.
> Also attaching the extracted content in txt file and original emf file.
> !102 emffile.EMF_extracted_with_issue.jpg|width=794,height=190!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4432) Issue with EMF Parser Merging Header, Page Number, and First Content Word During Extraction

Reply via email to