[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

Ajesh (Jira) Mon, 24 Jan 2022 21:02:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481539#comment-17481539
 ]


Ajesh commented on TIKA-3656:
-----------------------------

Let me clean the air by adding bit more logs from the application for a better 
idea.

*Scenario - 1*

Sample.docx (Content-type - DOCX, extension - DOCX)
{code:java}
Document name original - sample.docx
Content type - 
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Content type detected, ready to convert to pdf - 
application/vnd.openxmlformats-officedocument.wordprocessingml.document {code}
*Scenario - 2*

Sample.pdf (Content-type - DOCX, extension - PDF)
{code:java}
Document name original - sample.pdf
Content type - application/zip
10:04:58.139 [http-nio-8080-exec-6] ERROR 
com.hiringsteps.ats.applicant.facade.impl.ApplicantFacade - Error :
org.apache.xmlbeans.impl.piccolo.io.FileFormatException: Unsupported file type 
- [ application/zip ] {code}
Here we are expecting the content type as 
{code:java}
application/vnd.openxmlformats-officedocument.wordprocessingml.document {code}
This means if someone wrongly renamed the file extension we should be able to 
detect the right type by reading the file content.

> Tika returns wrong content type for docx types.
> -----------------------------------------------
>
>                 Key: TIKA-3656
>                 URL: https://issues.apache.org/jira/browse/TIKA-3656
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>         Environment: Windows 10, Java 1.8
>            Reporter: Ajesh
>            Priority: Major
>
> Steps to reproduce
>  # Select a DOCX file say example.docx
>  # Rename the DOCX file to PDF say example.pdf
>  # Use Tika to detect the content type of the example.pdf file
>  # Returns application/zip instead  
> application/vnd.openxmlformats-officedocument.wordprocessingml.document



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

Reply via email to