[
https://issues.apache.org/jira/browse/TIKA-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363026#comment-14363026
]
Denis Kildishev commented on TIKA-1144:
---------------------------------------
Thank you for reply. We are still using tika for .doc documents. There are some
additional improvements over WordDocExtractor (for example, css classes are
used in most cases instead of "b","u","i" tags). If you were interested in
those updates it is possible to form a patch for current tika version.
Best regards, Denis Kildishev
> Changes in styling mechanism, inner table support and list support for Word
> Extractor
> -------------------------------------------------------------------------------------
>
> Key: TIKA-1144
> URL: https://issues.apache.org/jira/browse/TIKA-1144
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Denis Kildishev
> Priority: Minor
> Attachments: word_style.patch
>
>
> Current version of Poi mechanisms can be used to support different kinds of
> styling and list handling. For current moment, Tika supports for styling of
> separate Character Runs, but this approach is not ideal and can lead to
> visual glitches in a form of pseudo spaces.
> Another option is lists. Information about them already can be obtained from
> poi representation, but this mechanism is not used in current version of Word
> Extractor.
> One of options that also can be solved now, is the problem of inner tables.
> It is not clearly related to two problems before, but the solution of this
> problem is based on the same mechanism as solution for previously listed
> problems. As an example of wrong handling can be file with table that
> includes another table in the first cell.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)