[ https://issues.apache.org/jira/browse/TIKA-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980493#comment-17980493 ]
Hudson commented on TIKA-4432: ------------------------------ SUCCESS: Integrated in Jenkins build Tika ยป tika-branch_3x-jdk11 #2079 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/2079/]) TIKA-4432 (#2255) (tallison: [https://github.com/apache/tika/commit/c6bf92dce2f250b560634cf95bdb2fa6697f5b13]) * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testEMF_zero_coords.emf * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/EMFParserTest.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/EMFParser.java > Issue with EMF Parser Merging Header, Page Number, and First Content Word > During Extraction > ------------------------------------------------------------------------------------------- > > Key: TIKA-4432 > URL: https://issues.apache.org/jira/browse/TIKA-4432 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Aashish Tudu > Priority: Major > Fix For: 4.0.0, 3.2.1 > > Attachments: 102 emffile.EMF, 102 emffile.EMF_extracted_with_fix.jpg, > 102 emffile.EMF_extracted_with_fix.txt, 102 > emffile.EMF_extracted_with_issue.jpg, 102 emffile.EMF_extracted_with_issue.txt > > > It appears that the EMF parser is incorrectly combining multiple elements > during text extraction. Specifically, information from the document header, > the page number, and the first word of the main content are being merged into > a single word. > Also attaching the extracted content in txt file and original emf file. > !102 emffile.EMF_extracted_with_issue.jpg|width=794,height=190! -- This message was sent by Atlassian Jira (v8.20.10#820010)