[ 
https://issues.apache.org/jira/browse/TIKA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Story updated TIKA-2179:
-----------------------------
    Description: 
h3. Problem
I have a sample word xml file (attached as File5.xml) that can be parsed by 
neither OOXMLParser (yields an exception that was {{Caused by: 
org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied 
data appears to be a raw XML file. Formats such as Office 2003 XML are not 
supported}}) nor by OfficeParser (yields an exception like: 
{{org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data 
appears to be a raw XML file. Formats such as Office 2003 XML are not 
supported}}

I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the 
source, built, and updated my tika version to 1.14. However, when parsing with 
WordMLParser, the output text content I get is the empty string {{""}}, but I'm 
expecting something more like:
{noformat}
It means that the guy that you are trading with was reported for a scam 
attempt. As the others mentioned, some of these BOFA could be false.
What's important is the current trade that you are doing.
If everything seems to be in order then there is nothing wrong with going 
through with the trade.
Auti, Sneha (QAPM)
{noformat}

h3. Replication
You can replicate with the below Spock test
{noformat}
    def "display error with WordMLParser"(){
        setup:
        File input = new File("/Users/sstory/Downloads/File5.xml") //modify for 
your path
        Parser parser = new WordMLParser()
        //Parser parser = new OOXMLParser()
        //Parser parser = new OfficeParser()
        org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
        Metadata metadata = new Metadata()
        ParseContext context = new ParseContext()
        
        when:
        parser.parse(input.newInputStream(), textHandler, metadata, context)
        String result = textHandler.toString()

        then:
        !result.isEmpty()
        result.contains("the guy that you are trading with")
        result.contains("BOFA")
    }
{noformat}

  was:
h3. Problem
I have a sample word xml file (attached as File5.xml) that can be parsed by 
neither OOXMLParser (yields an exception that was {{Caused by: 
org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied 
data appears to be a raw XML file. Formats such as Office 2003 XML are not 
supported}}) nor by OfficeParser (yields an exception like: 
{{org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data 
appears to be a raw XML file. Formats such as Office 2003 XML are not 
supported}}

I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the 
source, built, and updated my tika version to 1.14. However, when parsing with 
WordMLParser, the output text content I get is the empty string {{""}}, but I'm 
expecting something more like:
{noformat}
It means that the guy that you are trading with was reported for a scam 
attempt. As the others mentioned, some of these BOFA could be false.
What's important is the current trade that you are doing.
If everything seems to be in order then there is nothing wrong with going 
through with the trade.
Auti, Sneha (QAPM)
{noformat}


> WordMLParser fails to parse a word xml file
> -------------------------------------------
>
>                 Key: TIKA-2179
>                 URL: https://issues.apache.org/jira/browse/TIKA-2179
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>         Environment: OSX, java 8
>            Reporter: Sean Story
>            Priority: Minor
>         Attachments: File5.xml
>
>
> h3. Problem
> I have a sample word xml file (attached as File5.xml) that can be parsed by 
> neither OOXMLParser (yields an exception that was {{Caused by: 
> org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied 
> data appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}) nor by OfficeParser (yields an exception like: 
> {{org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data 
> appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}
> I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the 
> source, built, and updated my tika version to 1.14. However, when parsing 
> with WordMLParser, the output text content I get is the empty string {{""}}, 
> but I'm expecting something more like:
> {noformat}
> It means that the guy that you are trading with was reported for a scam 
> attempt. As the others mentioned, some of these BOFA could be false.
> What's important is the current trade that you are doing.
> If everything seems to be in order then there is nothing wrong with going 
> through with the trade.
> Auti, Sneha (QAPM)
> {noformat}
> h3. Replication
> You can replicate with the below Spock test
> {noformat}
>     def "display error with WordMLParser"(){
>         setup:
>         File input = new File("/Users/sstory/Downloads/File5.xml") //modify 
> for your path
>         Parser parser = new WordMLParser()
>         //Parser parser = new OOXMLParser()
>         //Parser parser = new OfficeParser()
>         org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
>         Metadata metadata = new Metadata()
>         ParseContext context = new ParseContext()
>         
>         when:
>         parser.parse(input.newInputStream(), textHandler, metadata, context)
>         String result = textHandler.toString()
>         then:
>         !result.isEmpty()
>         result.contains("the guy that you are trading with")
>         result.contains("BOFA")
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to