[ 
https://issues.apache.org/jira/browse/TIKA-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508874#comment-17508874
 ] 

Michał Ruszkowski commented on TIKA-3700:
-----------------------------------------

I'm also not able to replicate on your branch.
Spent few hours on this, trying to copy everything from our project part by 
part and what I was able to find is:
- we have 1 internal company dependency, let's name it D, that we import in 
pom.xml. If I add it to the pom's dependencies, then the failing test appears
- adding all dependencies/properties from D's pom.xml to our test project 
doesn't cause test to fail

I'm stuck at this point, but I think we can use 2.2.1 for now, unless you want 
to find explanation for this :). 
I have no more ideas... was thinking about some class name duplication between 
packages... but they have full path in import statement, so really don't know. 

Greetings and have a nice weekend :D

> DefaultZipContainerDetector fails to recognize .docx file
> ---------------------------------------------------------
>
>                 Key: TIKA-3700
>                 URL: https://issues.apache.org/jira/browse/TIKA-3700
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.3.0
>         Environment: Ubuntu + mvn 3.6.3 + java 8
>            Reporter: Michał Ruszkowski
>            Priority: Major
>         Attachments: Testworddocx.docx, Testworddocx2.docx
>
>
> Hello,
> Recently my team upgraded from Tika 1.x to 2.3 due to vulnerability and I 
> noticed problem with file type detection based on content.
>  * we have simple test that calls method 
> {code:java}
> tika.getDetector().detect(tikaInputStream, metadata);{code}
>  * the file that we create inputStream from is placed inside 
> _/test/resources_ and it is *.docx*
>  * the detector method DefaultZipContainerDetector.detect() returns 
> application/x-tika-ooxml when we run mvn install
>  * following test was working with Tika 1.x
>  * we have dependencies in pom.xml _*tika-core*_ and 
> _*tika-parsers-standard-package*_           
> The most strange is the fact that the same test run successfully through 
> IntelliJ 'Run Test...' button.
>  * I tried using UTF-8 encoding in maven's pom.xml as well as using parameter 
> -Dfile.encoding=UTF-8 while install with no success.
>  * I compared content of files in boths cases (successfull test and failed 
> one) and they look almost the same, however in one case whitespaces seems to 
> be bigger. Don't know if it can make a difference, but here is example 
> content of file that is properly detected: 
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u{code}
>  
> and here is the same line of content that fails (notice additional whitespace 
> before 'q(='
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&�  q(=X�� ��!.���,�_�WF�L8W()���u {code}
>  * I just checked and it works fine with Tika 2.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to