https://issues.apache.org/bugzilla/show_bug.cgi?id=47742
Summary: The text extracted by WordExtractor is broken
Product: POI
Version: 3.5-dev
Platform: PC
OS/Version: Windows XP
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
AssignedTo: [email protected]
ReportedBy: [email protected]
--- Comment #0 from [email protected] 2009-08-26 07:42:12 PDT ---
Created an attachment (id=24169)
this JUnit3 test reproduces the bug, i.e. this test fails
We used the WordExtractor class to extract text from the attached Word
document.
Unfortunately, the extracted text differs from the text seen in the Word
document.
More precisely, some paragraphs appear twice and some text appears to be on the
wrong position.
We tried to track the error down to any part of the document but we could not
identify the part that caused the error. It looks like as the length of the
text or certain unicode characters cause the error but this is just guessing.
We attach a JUnit test case that reproduces the bug.
ExtractTextFromWordDocumentTest.java - the Junit3 test case
test.doc - the MS Word document from that we cannot extract the text properly
test-EXTRACTED-BY-POI-WordExtractor.txt - the text extracted by POI
test-SAVED-BY-MS-WORD.txt - the text as it is recognized by MS Word
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]