[
https://issues.apache.org/jira/browse/TIKA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2179:
------------------------------
Fix Version/s: 1.15
2.0
> WordMLParser fails to parse a word xml file
> -------------------------------------------
>
> Key: TIKA-2179
> URL: https://issues.apache.org/jira/browse/TIKA-2179
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.14
> Environment: OSX, java 8
> Reporter: Sean Story
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 2.0, 1.15
>
> Attachments: File5.xml
>
>
> h3. Problem
> I have a sample word xml file (attached as File5.xml) that can be parsed by
> neither OOXMLParser (yields an exception that was {{Caused by:
> org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied
> data appears to be a raw XML file. Formats such as Office 2003 XML are not
> supported}}) nor by OfficeParser (yields an exception like:
> {{org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data
> appears to be a raw XML file. Formats such as Office 2003 XML are not
> supported}}
> I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the
> source, built, and updated my tika version to 1.14. However, when parsing
> with WordMLParser, the output text content I get is the empty string {{""}},
> but I'm expecting something more like:
> {noformat}
> It means that the guy that you are trading with was reported for a scam
> attempt. As the others mentioned, some of these BOFA could be false.
> What's important is the current trade that you are doing.
> If everything seems to be in order then there is nothing wrong with going
> through with the trade.
> Auti, Sneha (QAPM)
> {noformat}
> h3. Replication
> You can replicate with the below Spock test
> {noformat}
> def "display error with WordMLParser"(){
> setup:
> File input = new File("/Users/sstory/Downloads/File5.xml") //modify
> for your path
> Parser parser = new WordMLParser()
> //Parser parser = new OOXMLParser()
> //Parser parser = new OfficeParser()
> org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
> Metadata metadata = new Metadata()
> ParseContext context = new ParseContext()
>
> when:
> parser.parse(input.newInputStream(), textHandler, metadata, context)
> String result = textHandler.toString()
> then:
> !result.isEmpty()
> result.contains("the guy that you are trading with")
> result.contains("BOFA")
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)