https://bz.apache.org/bugzilla/show_bug.cgi?id=60471

            Bug ID: 60471
           Summary: Not loading AlternateContent in XWPF
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: dev@poi.apache.org
          Reporter: talli...@mitre.org
  Target Milestone: ---

Created attachment 34522
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34522&action=edit
triggering file based on testWORD_2006ml.docx in Tika

XWPFDocument's onDocumentLoad() looks for paragraphs, tables and sdts at the
main level of the body.  As we saw with Bug 54849 (SDTs), there can be other
intervening structures between the body and text-containing elements.  

I recently noticed that AlternateContent elements can also appear at the body
level, and we should probably add those to our document model.

To create this test file, I added a title page via Word's default "add a title
page function".

In the SAX parser that I added to Tika, I chose to extract text from the
Fallback section on the theory that that would have the more easily parseable
content.  If we're modeling read/write in our DOM/XWPFDocument, we'll probably
want to point to both Fallback and Choice?

Unit test:

    public void testAlternateContent() throws IOException {
        XWPFDocument doc =
XWPFTestDataSamples.openSampleDocument("testAlternateContent.docx");
        XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

        String txt = extractor.getText();
        assertContainsSpecificCount("engaging abstract", txt, 1);
        assertContainsSpecificCount("MyDocumentTitle", txt, 1);
        assertContainsSpecificCount("MyDocumentSubtitle", txt, 1);
    }

    private void assertContainsSpecificCount(String needle, String haystack,
int expectedCount) {
        int index = haystack.indexOf(needle);
        int found = 0;
        while (index > -1) {
            found++;
            index = haystack.indexOf(needle, index+1);
        }
        assertEquals(expectedCount, found);
    }

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to