aashishtudu opened a new pull request, #2241: URL: https://github.com/apache/tika/pull/2241
It appears that the EMF parser is incorrectly combining multiple elements during text extraction. Specifically, information from the document header, the page number, and the first word of the main content are being merged into a single word. The issue stems from how text fragments (EmfExtTextOutW records) are appended to a buffer. In cases where the EMF text does not include a trailing space, the next fragment gets concatenated directly without separation. The logic in PR ensures that if a text fragment does not end with a space, an explicit space is added between it and the next fragment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org