[ 
https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-506:
----------------------------

    Attachment: tika-word11.patch

New patch (v11) adds support for .doc images, and non-nested .doc tables 
(nested tables is going to need some more POI work). Now the only thing that is 
supported for .docx but not .doc is nested tables - in .doc the nested tables 
come out as regular paragraphs.

As part of this, I've had to add a new boolean option to 
EmbeddedDocumentExtractor, so a parser can tell it if html for the embedded 
resource has already been output. In most cases, the current behaviour of "no 
html has yet been output" will apply, and EmbeddedDocumentExtractor should 
output html as before. In a few cases however, the parser will have done its 
own markup, so we don't want the extra bits.

The patch needs poi 3.7 beta 3, so can't be applied until that has been released

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>         Attachments: tika-word11.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't 
> currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a 
> paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to