[ 
https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916101#action_12916101
 ] 

Nick Burch commented on TIKA-506:
---------------------------------

I'm not sure we want to end up with the same HTML as openoffice, as personally 
I think it contains too many tags that aren't appropriate for the html. (For 
example, my view is that in the html, the choice of the font is one for the 
people writing the css, not something that should be copied blindly from word). 
I quite like the clean, semantically meaningful html we've now got!

In terms of the colours, I suspect they're available somewhere in the bowls of 
the character and paragraph properties. However, there's no current high level 
way to get at them AFAIK. You'd probably need to grab a copy of the word specs, 
figure out which fields hold them, find that in poi, further decode and finally 
write a nice user-facing access method for that

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, 
> tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't 
> currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a 
> paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to