[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219386#comment-14219386
 ] 

Milan Zivkovic commented on TIKA-1473:
--------------------------------------

I am not really sure how to clean the sensitive data. If you can help me I 
would gladly do that. 
If I try to save using word ( even without changing anything ) in the document 
I can not reproduce the problem with the newly created document. 

I was also thinking that I can unzip the file leave the structure and than zip 
again but same thing just unzipping and zipping again can not re-produce the 
problem with the new document. Maybe I am doing something wrong here?

If I ran the Linux file command for the file file I get the "Microsoft Word 
2007+". If I unzip and zip again I get the " Microsoft OOXML" as an output from 
the file command.


> Apache Tika is not working for .docx documents 
> -----------------------------------------------
>
>                 Key: TIKA-1473
>                 URL: https://issues.apache.org/jira/browse/TIKA-1473
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5, 1.6
>            Reporter: Franco Catto
>            Priority: Blocker
>
> I am using Apache Tika 1.6 to read different document files. 
> It is reading pdf and old format doc files but when I try to read docx file, 
> it gives me following exception:
> org.apache.tika.exception.TikaException: Failed to close temporary resources 
> at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
> ...
> The resource can not be closed because it is still being used by the Java 
> Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to