Tim Allison created TIKA-4015:
---------------------------------

             Summary: Extract symbols as symbols from .docx
                 Key: TIKA-4015
                 URL: https://issues.apache.org/jira/browse/TIKA-4015
             Project: Tika
          Issue Type: New Feature
            Reporter: Tim Allison
         Attachments: symbol.docx.zip

[~chetab] raised this issue on the user list.and supplied an example document.

The Font is symbol and the text should be: abcedefghijklmnopqrstuvwxyz

However, the text as literally stored in the docx and extracted by Tika is: 
abcedefghijklmnopqrstuvwxyz

 

We may need to add processing for unicode mappings or the equivalent in ooxml.  
I haven't seen this before. :P



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to