[
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481539#comment-17481539
]
Ajesh commented on TIKA-3656:
-----------------------------
Let me clean the air by adding bit more logs from the application for a better
idea.
*Scenario - 1*
Sample.docx (Content-type - DOCX, extension - DOCX)
{code:java}
Document name original - sample.docx
Content type -
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Content type detected, ready to convert to pdf -
application/vnd.openxmlformats-officedocument.wordprocessingml.document {code}
*Scenario - 2*
Sample.pdf (Content-type - DOCX, extension - PDF)
{code:java}
Document name original - sample.pdf
Content type - application/zip
10:04:58.139 [http-nio-8080-exec-6] ERROR
com.hiringsteps.ats.applicant.facade.impl.ApplicantFacade - Error :
org.apache.xmlbeans.impl.piccolo.io.FileFormatException: Unsupported file type
- [ application/zip ] {code}
Here we are expecting the content type as
{code:java}
application/vnd.openxmlformats-officedocument.wordprocessingml.document {code}
This means if someone wrongly renamed the file extension we should be able to
detect the right type by reading the file content.
> Tika returns wrong content type for docx types.
> -----------------------------------------------
>
> Key: TIKA-3656
> URL: https://issues.apache.org/jira/browse/TIKA-3656
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.2.0
> Environment: Windows 10, Java 1.8
> Reporter: Ajesh
> Priority: Major
>
> Steps to reproduce
> # Select a DOCX file say example.docx
> # Rename the DOCX file to PDF say example.pdf
> # Use Tika to detect the content type of the example.pdf file
> # Returns application/zip instead
> application/vnd.openxmlformats-officedocument.wordprocessingml.document
--
This message was sent by Atlassian Jira
(v8.20.1#820001)