[
https://issues.apache.org/jira/browse/TIKA-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891055#comment-16891055
]
Tim Allison edited comment on TIKA-2916 at 7/23/19 2:12 PM:
------------------------------------------------------------
I found that we had roughly ~8000 hwp files in our updated commoncrawl3
regression corpus. I've tgz'd these and put them here:
http://162.242.228.174/share/hwp_cc3.tgz These are not all v5, but I thought
I'd bundle all hwp together.
I dumped the poifs entries from those files and attached them to this issue.
It looks like there are a bunch of image files under /BinData. I've only begun
to look through the attached .zip file, though, there may be other items of
interest.
was (Author: [email protected]):
I found that we had roughly ~8000 hwp files in our updated commoncrawl3
regression corpus. I've tgz'd these and put them here:
http://162.242.228.174/share/hwp_cc3.tgz
I dumped the poifs entries from those files and attached them to this issue.
It looks like there are a bunch of image files under /BinData. I've only begun
to look through the attached .zip file, though, there may be other items of
interest.
> Extract attachments from HWP v5
> -------------------------------
>
> Key: TIKA-2916
> URL: https://issues.apache.org/jira/browse/TIKA-2916
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
> Attachments: poifs_dump_hwp_table.zip
>
>
> It looks like it is possible to attach images (at least) into hwp. I don't
> have the application, so I can't tell if these are inline images or "regular"
> attachments. We should process these as we do in other files.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)