[ 
https://issues.apache.org/jira/browse/TIKA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667984#comment-15667984
 ] 

Sean Story commented on TIKA-2179:
----------------------------------

Using XMLParser provides a reasonable workaround, but the output ends up 
looking like: 
{noformat}
                              It means that the guy that you are trading with 
was reported for a scam attempt. As the others mentioned, some of these BO     
FA      could be false.           What's important is the current trade that 
you are doing.           If everything seems to be in order then there is 
nothing wrong with going through with the trade.                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                          Auti, Sneha (QAPM) Auti, Sneha (QAPM) 2 
2016-09-14T06:16:00Z 2016-09-14T06:23:00Z                                       
                                                                                
                                                                                
       Normal.dotm 7 1 44 257 Microsoft Office Word 0 2 1 false Morgan Stanley 
false 300 false false 14.0000
{noformat}

which is sub-optimal, since it has added whitespace characters all over the 
content

> WordMLParser fails to parse a word xml file
> -------------------------------------------
>
>                 Key: TIKA-2179
>                 URL: https://issues.apache.org/jira/browse/TIKA-2179
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>         Environment: OSX, java 8
>            Reporter: Sean Story
>            Priority: Minor
>         Attachments: File5.xml
>
>
> h3. Problem
> I have a sample word xml file (attached as File5.xml) that can be parsed by 
> neither OOXMLParser (yields an exception that was {{Caused by: 
> org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied 
> data appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}) nor by OfficeParser (yields an exception like: 
> {{org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data 
> appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}
> I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the 
> source, built, and updated my tika version to 1.14. However, when parsing 
> with WordMLParser, the output text content I get is the empty string {{""}}, 
> but I'm expecting something more like:
> {noformat}
> It means that the guy that you are trading with was reported for a scam 
> attempt. As the others mentioned, some of these BOFA could be false.
> What's important is the current trade that you are doing.
> If everything seems to be in order then there is nothing wrong with going 
> through with the trade.
> Auti, Sneha (QAPM)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to