Aashish Tudu created TIKA-4432: ---------------------------------- Summary: Issue with EMF Parser Merging Header, Page Number, and First Content Word During Extraction Key: TIKA-4432 URL: https://issues.apache.org/jira/browse/TIKA-4432 Project: Tika Issue Type: Bug Components: parser Reporter: Aashish Tudu Attachments: 102 emffile.EMF, 102 emffile.EMF_extracted_with_issue.txt
It appears that the EMF parser is incorrectly combining multiple elements during text extraction. Specifically, information from the document header, the page number, and the first word of the main content are being merged into a single word. Also attaching the extracted content in txt file and original emf file. !102 emffile.EMF_extracted_with_issue.jpg|width=794,height=190! -- This message was sent by Atlassian Jira (v8.20.10#820010)