Cristian Vat created TIKA-3008:
----------------------------------
Summary: Word Doc/Docx Formatting Extraction -
Superscript/Subscript
Key: TIKA-3008
URL: https://issues.apache.org/jira/browse/TIKA-3008
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.23
Reporter: Cristian Vat
Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.
This changes the actual text extracted since character runs are merged together
if only sup/sub is the difference since it doesn't generate any tags in between.
Found to be especially problematic in case of some legal documents where
getting "according to Art 51" instead of "according to Art 5^1^" completely
changes the meaning.
Problem seems to be both in old Word .doc and OOXML .docx formats parsing.
Sub/sup can be present on actual character run or on the document style
assigned to a character run.
I'm already working on fixes and test documents, will comment with work in
progress branch.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)