[
https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915557#action_12915557
]
Geoff Jarrad commented on TIKA-506:
-----------------------------------
Good work on extracting more of .doc and .docx documents!
Interestingly, the OOXMLParser now implements almost all of the (independent)
hacks I recently made in my own version of the parser.
Some HTML normalisations for .docx documents that I used, and would be nice to
have, were:
<p class="heading_1">...</p> --> <h1>...</h1>
<p class="heading_2">...</p> --> <h2>...</h2>
<p class="hTML_Preformatted">...</p> --> <pre>...</pre>
Also, in some documents I have encountered, sometimes a text snippet is
obscured by Word adding entity w:smartTag elements. This text does not get
extracted by the OOXMLParser, but I'm not sure if the best fix lies in Tika,
POI or Microsoft's ooxml schemas. My own hack was to reparse the DOM for
paragraphs, looking for w:r elements at any depth (including within
w:smartTag), but there must be a better way.
Finally, for the uses to which I put Tika, it would be nice for .doc documents
if color font styles could be extracted, but I'm not sure if POI makes these
available.
> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
> Key: TIKA-506
> URL: https://issues.apache.org/jira/browse/TIKA-506
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.7
> Reporter: Nick Burch
> Assignee: Nick Burch
> Fix For: 0.8
>
> Attachments: tika-word11.patch, tika-word12.patch, tika-word6.patch,
> tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't
> currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a
> paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.