[ 
https://issues.apache.org/jira/browse/TIKA-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735875#comment-14735875
 ] 

mungeol heo commented on TIKA-1728:
-----------------------------------

This is what I know,
The v3 file's first 30 bytes includes "HWP Document File V3.00 \x1a\1\2\3\4\" 
for recognizing v3 file.
And the v5 file stores "HWP Document File" in the first 32 bytes of file header.
I wonder, whether "HWP Document File" can be unique bytes of v5 file or not.

> Detection is not working properly for detecting HWP 5.0 file
> ------------------------------------------------------------
>
>                 Key: TIKA-1728
>                 URL: https://issues.apache.org/jira/browse/TIKA-1728
>             Project: Tika
>          Issue Type: Bug
>         Environment: OS: windows 7 and centos 6
> Java: 1.7
> Tika jar: tika-app-1.10.jar
> File: HWP 5.0
>            Reporter: mungeol heo
>         Attachments: HWP-document-file-formats-3.0-Korean.pdf, 
> HWP-document-file-formats-5.0-Korean.pdf, error-message.png, test_3.0.hwp, 
> test_5.0.hwp
>
>
> HWP file has two formats which are HWP 3.0 and HWP 5.0.
> 'tika-app-1.10.jar' detects HWP 3.0 format's file correctly.
> But, not for HWP 5.0.
> Used commands and returned results are addresses below.
> > java -jar tika-app-1.10.jar --detect test_3.0.hwp
> > application/x-hwp
> > java -jar tika-app-1.10.jar --detect test_5.0.hwp
> > application/x-tika-msoffice



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to