[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Geoff Jarrad (JIRA) Tue, 28 Sep 2010 16:46:57 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915971#action_12915971
 ]


Geoff Jarrad commented on TIKA-506:
-----------------------------------

Brilliant work, Nick! Thanks. The sample.doc runs through Tika like a dream.

Now, do you think it might be feasible to extract font colours?  Or is there 
currently no support from the POI side of things?

It has become crucial in my work on document analysis to be able to determine 
the background colour of a table cell, as well as the foreground colour of text 
(seems odd, I know, but that's how the document originators are encoding some 
information). Currently I am being forced to divert .doc documents to an 
OpenOffice.org service for translation to HTML, then using Tika's HtmlParser to 
decode that into ContentHandler events. Being so close to having a sufficient 
.doc parser native to Tika (courtesy of the great work of yourself and others) 
is both exciting and frustrating!

What are your thoughts? Actually, it's actually quite instructive to see what 
HTML OpenOffice.org produces from a Word document, which is why I say the 
OfficeParser is currently so close. Wouldn't it be amazing if, in the future, 
.doc, .docx and .odt versions of the same document were all parsed to the same 
HTML?

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, 
> tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't 
> currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a 
> paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Reply via email to