[ 
https://issues.apache.org/jira/browse/TIKA-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michał Ruszkowski updated TIKA-3700:
------------------------------------
    Attachment:     (was: Testworddocx.docx)

> DefaultZipContainerDetector fails to recognize .docx file
> ---------------------------------------------------------
>
>                 Key: TIKA-3700
>                 URL: https://issues.apache.org/jira/browse/TIKA-3700
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.3.0
>         Environment: Ubuntu + mvn 3.6.3 + java 8
>            Reporter: Michał Ruszkowski
>            Priority: Major
>         Attachments: Testworddocx.docx
>
>
> Hello,
> Recently my team upgraded from Tika 1.x to 2.3 due to vulnerability and I 
> noticed problem with file type detection based on content.
>  * we have simple test that calls method 
> {code:java}
> tika.getDetector().detect(tikaInputStream, metadata);{code}
>  * the file that we create inputStream from is placed inside 
> _/test/resources_ and it is *.docx*
>  * the detector method DefaultZipContainerDetector.detect() returns 
> application/x-tika-ooxml when we run mvn install
>  * following test was working with Tika 1.x
>  * we have dependencies in pom.xml _*tika-core*_ and 
> _*tika-parsers-standard-package*_           
> The most strange is the fact that the same test run successfully through 
> IntelliJ 'Run Test...' button.
>  * I tried using UTF-8 encoding in maven's pom.xml as well as using parameter 
> -Dfile.encoding=UTF-8 while install with no success.
>  * I compared content of files in boths cases (successfull test and failed 
> one) and they look almost the same, however in one case whitespaces seems to 
> be bigger. Don't know if it can make a difference, but here is example 
> content of file that is properly detected: 
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u{code}
>  
> and here is the same line of content that fails (notice additional whitespace 
> before 'q(='
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&�  q(=X�� ��!.���,�_�WF�L8W()���u {code}
>  * I just checked and it works fine with Tika 2.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to