[
https://issues.apache.org/jira/browse/TIKA-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michał Ruszkowski updated TIKA-3700:
------------------------------------
Attachment: Testworddocx.docx
> DefaultZipContainerDetector fails to recognize .docx file
> ---------------------------------------------------------
>
> Key: TIKA-3700
> URL: https://issues.apache.org/jira/browse/TIKA-3700
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 2.3.0
> Environment: Ubuntu + mvn 3.6.3 + java 8
> Reporter: Michał Ruszkowski
> Priority: Major
> Attachments: Testworddocx.docx
>
>
> Hello,
> Recently my team upgraded from Tika 1.x to 2.3 due to vulnerability and I
> noticed problem with file type detection based on content.
> * we have simple test that calls method
> {code:java}
> tika.getDetector().detect(tikaInputStream, metadata);{code}
> * the file that we create inputStream from is placed inside
> _/test/resources_ and it is *.docx*
> * the detector method DefaultZipContainerDetector.detect() returns
> application/x-tika-ooxml when we run mvn install
> * following test was working with Tika 1.x
> * we have dependencies in pom.xml _*tika-core*_ and
> _*tika-parsers-standard-package*_
> The most strange is the fact that the same test run successfully through
> IntelliJ 'Run Test...' button.
> * I tried using UTF-8 encoding in maven's pom.xml as well as using parameter
> -Dfile.encoding=UTF-8 while install with no success.
> * I compared content of files in boths cases (successfull test and failed
> one) and they look almost the same, however in one case whitespaces seems to
> be bigger. Don't know if it can make a difference, but here is example
> content of file that is properly detected:
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u{code}
>
> and here is the same line of content that fails (notice additional whitespace
> before 'q(='
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u {code}
> * I just checked and it works fine with Tika 2.2.1
--
This message was sent by Atlassian Jira
(v8.20.1#820001)