Konrad Tendera created TIKA-1094:
------------------------------------

             Summary: Bugged WordExtractor#handleSpecialCharacterRun method
                 Key: TIKA-1094
                 URL: https://issues.apache.org/jira/browse/TIKA-1094
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Konrad Tendera
            Priority: Minor


As javadoc says, special character runs are defined as follow:

"Can be \13..text..\15 or \13..control..\14..text..\15"

In fact there are some serious differences which causes that e.g. hyperlinks 
aren't parsed properly. I checked it using LibreOffice and Microsoft Office and 
I figured out that paragraph containing HYPERLINK looks rather like that:

\13 (space here)HYPERLINK "address here" \1 \14 text \15

"\u0001" and "\u0014" are separate character runs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to