Aashish Tudu created TIKA-4432:
----------------------------------

             Summary: Issue with EMF Parser Merging Header, Page Number, and 
First Content Word During Extraction
                 Key: TIKA-4432
                 URL: https://issues.apache.org/jira/browse/TIKA-4432
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Aashish Tudu
         Attachments: 102 emffile.EMF, 102 emffile.EMF_extracted_with_issue.txt

It appears that the EMF parser is incorrectly combining multiple elements 
during text extraction. Specifically, information from the document header, the 
page number, and the first word of the main content are being merged into a 
single word.

Also attaching the extracted content in txt file and original emf file.

!102 emffile.EMF_extracted_with_issue.jpg|width=794,height=190!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to