[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Nick Burch (JIRA) Fri, 01 Oct 2010 04:40:00 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916858#action_12916858
 ]


Nick Burch commented on TIKA-506:
---------------------------------

In terms of the 3 different formats, I'd suggest you open a new bug so we can 
track it there. It ought to be possible to get fairly close, but it'll 
certainly require someone to spend some time preparing documents and testing 
them...

For the colours, we should probably take the discussion the poi dev list, as 
it'll almost certainly need some work on POI to expose the information before 
Tika could use it.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, 
> tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't 
> currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a 
> paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Reply via email to