[ 
https://issues.apache.org/jira/browse/TIKA-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135122#comment-17135122
 ] 

Cristian Vat commented on TIKA-3008:
------------------------------------

Opened PR with handling for basic use-cases and sample documents for tests.

Where sub/sup can appear:
 * handled in old/new Word format: on individual character run
 * handled in new Word format: on style applied to character run
 * not handled: on ancestor style of style applied to character run

For old Word format style I think I couldn't find the proper information in POI 
by myself yet.

For ancestor styles I didn't attempt it as styles can be nested up to 9 levels 
deep. I'm not sure on how that should be handled as it could mean a performance 
penalty always checking ancestor styles or even some loop if styles create a 
cycle. Any ideas are appreciated.

> Word Doc/Docx Formatting Extraction - Superscript/Subscript
> -----------------------------------------------------------
>
>                 Key: TIKA-3008
>                 URL: https://issues.apache.org/jira/browse/TIKA-3008
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.23
>            Reporter: Cristian Vat
>            Priority: Major
>
> Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.
> This changes the actual text extracted since character runs are merged 
> together if only sup/sub is the difference since it doesn't generate any tags 
> in between.
> Found to be especially problematic in case of some legal documents where 
> getting "according to Art 51" instead of "according to Art 5^1^" completely 
> changes the meaning.
>  
> Problem seems to be both in old Word .doc and OOXML .docx formats parsing.
> Sub/sup can be present on actual character run or on the document style 
> assigned to a character run.
>  
> I'm already working on fixes and test documents, will comment with work in 
> progress branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to