[
https://issues.apache.org/jira/browse/TIKA-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denis Kildishev updated TIKA-1140:
----------------------------------
Description:
As for current version of Word Extractor, it have access to different
features of tables, but most of them are not used. As an example of possible
improvements, may be support for borders, fixed cell widths and cell spanning.
It should be noted that some of that features are already used in poi version
of Html converted, so, that code can be reused in Tika.
As an example of possible solution may be patch linked as an attachment. It
have some code that is based on 2007 version of doc format
specification(especially, Border type and color detection), so, different
improvements tends to be made to meet with older formats.
Patch already includes some changes in unit tests, that are required in
accordance with changes in document structure.
was:
As for current version of Word Extractor, it have access to different
features of tables, but most of them are not used. As an example of possible
improvements, may be support for borders, fixed cell widths and cell spanning.
It should be noted that some of that features are already used in poi version
of Html converted, so, that code can be reused in Tika.
> Better table representation, cell spanning in Word Extractor
> ------------------------------------------------------------
>
> Key: TIKA-1140
> URL: https://issues.apache.org/jira/browse/TIKA-1140
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Denis Kildishev
> Priority: Minor
> Attachments: word_table.patch
>
>
> As for current version of Word Extractor, it have access to different
> features of tables, but most of them are not used. As an example of possible
> improvements, may be support for borders, fixed cell widths and cell spanning.
> It should be noted that some of that features are already used in poi version
> of Html converted, so, that code can be reused in Tika.
> As an example of possible solution may be patch linked as an attachment. It
> have some code that is based on 2007 version of doc format
> specification(especially, Border type and color detection), so, different
> improvements tends to be made to meet with older formats.
> Patch already includes some changes in unit tests, that are required in
> accordance with changes in document structure.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira