[ 
https://issues.apache.org/jira/browse/TIKA-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362172#comment-14362172
 ] 

Tyler Palsulich commented on TIKA-1140:
---------------------------------------

Did anyone ever get a chance to look at this Doc Parser patch? Sorry, 
[~kildishev]!

> Better table representation, cell spanning in Word Extractor
> ------------------------------------------------------------
>
>                 Key: TIKA-1140
>                 URL: https://issues.apache.org/jira/browse/TIKA-1140
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Denis Kildishev
>            Priority: Minor
>         Attachments: word_table.patch
>
>
> As for current version of Word Extractor, it have access to different 
> features of tables, but most of them are not used. As an example of possible 
> improvements, may be support for borders, fixed cell widths and cell spanning.
> It should be noted that some of that features are already used in poi version 
> of Html converted, so, that code can be reused in Tika.
> As an example of possible solution may be patch linked as an attachment. It 
> have some code that is based on 2007 version of doc format 
> specification(especially, Border type and color detection), so, different 
> improvements tends to be made to meet with older formats.
> Patch already includes some changes in unit tests, that are required in 
> accordance with changes in document structure.
>       



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to