Konrad Tendera created TIKA-1094:
------------------------------------
Summary: Bugged WordExtractor#handleSpecialCharacterRun method
Key: TIKA-1094
URL: https://issues.apache.org/jira/browse/TIKA-1094
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Konrad Tendera
Priority: Minor
As javadoc says, special character runs are defined as follow:
"Can be \13..text..\15 or \13..control..\14..text..\15"
In fact there are some serious differences which causes that e.g. hyperlinks
aren't parsed properly. I checked it using LibreOffice and Microsoft Office and
I figured out that paragraph containing HYPERLINK looks rather like that:
\13 (space here)HYPERLINK "address here" \1 \14 text \15
"\u0001" and "\u0014" are separate character runs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira