[ 
https://issues.apache.org/jira/browse/TIKA-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740625#comment-14740625
 ] 

Tim Allison commented on TIKA-1731:
-----------------------------------

Based on only a very cursory look at the examples+specs you sent, I'd say:
# HWP 3.0 = HWP 3.0 (it appears to be its own binary format...might be derived 
from something.  I just don't know)
# HWP 5.0 ~ (is kind of like) .doc ... it uses the same general underlying file 
structures as .doc (OLE), but it does some dramatically different things.

If you can figure out how to generate the equivalent of a .docx from hwp, it'd 
be useful to see if Tika can handle that.

To test equivalence with .docx, change the file suffix to .zip, and unzip it.  
If it unzips and you see a bunch of xml files, we're on the right track...


> Try to integrate java-hwp into Tika
> -----------------------------------
>
>                 Key: TIKA-1731
>                 URL: https://issues.apache.org/jira/browse/TIKA-1731
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Minor
>
> Now that we have detection working for hwp files, it would be great to add a 
> parser.
> [java-hwp|https://github.com/ddoleye/java-hwp] looks like a promising 
> candidate.  We'd need to ask ddoleye about a potential change in license and 
> then interest in maintenance + pushing to maven.
> Any other candidates?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to