Improve doc and docx parsing to include more things
---------------------------------------------------

                 Key: TIKA-506
                 URL: https://issues.apache.org/jira/browse/TIKA-506
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.7
            Reporter: Nick Burch
            Assignee: Nick Burch


There are several parts of the word documents (.doc and .docx) that we don't 
currently extract, but which would be nice to have.

These include:
* Hyperlinks
* Images (img tag referencing the name of the embeded image)
* Headings (when the default heading styles are used)
* Style information (when a style other than Default or a body is used on a 
paragraph, markup the p tag with it)

I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to