DO NOT REPLY [Bug 47742] New: The text extracted by WordExtractor is broken

bugzilla Wed, 26 Aug 2009 07:42:39 -0700

https://issues.apache.org/bugzilla/show_bug.cgi?id=47742


           Summary: The text extracted by WordExtractor is broken
           Product: POI
           Version: 3.5-dev
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
        AssignedTo: [email protected]
        ReportedBy: [email protected]


--- Comment #0 from [email protected] 2009-08-26 07:42:12 PDT ---
Created an attachment (id=24169)
this JUnit3 test reproduces the bug, i.e. this test fails

We used the WordExtractor class to extract text from the attached Word
document.

Unfortunately, the extracted text differs from the text seen in the Word
document.

More precisely, some paragraphs appear twice and some text appears to be on the
wrong position.

We tried to track the error down to any part of the document but we could not
identify the part that caused the error. It looks like as the length of the
text or certain unicode characters cause the error but this is just guessing.


We attach a JUnit test case that reproduces the bug.

  ExtractTextFromWordDocumentTest.java - the Junit3 test case
  test.doc - the MS Word document from that we cannot extract the text properly
  test-EXTRACTED-BY-POI-WordExtractor.txt - the text extracted by POI
  test-SAVED-BY-MS-WORD.txt - the text as it is recognized by MS Word

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

DO NOT REPLY [Bug 47742] New: The text extracted by WordExtractor is broken

Reply via email to