[ https://issues.apache.org/jira/browse/TIKA-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tyler Palsulich resolved TIKA-1094. ----------------------------------- Resolution: Fixed Marking as fixed, since the linked files are parsed into the following correct-looking content: {code} <body><p>To jest <a href="http://onet.pl/">jakiĆ</a> link.</p> {code} > Bugged WordExtractor#handleSpecialCharacterRun method > ----------------------------------------------------- > > Key: TIKA-1094 > URL: https://issues.apache.org/jira/browse/TIKA-1094 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Konrad Tendera > Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > As javadoc says, special character runs are defined as follow: > "Can be \13..text..\15 or \13..control..\14..text..\15" > In fact there are some serious differences which causes that e.g. hyperlinks > aren't parsed properly. I checked it using LibreOffice and Microsoft Office > and I figured out that paragraph containing HYPERLINK looks rather like that: > \13 (space here)HYPERLINK "address here" \1 \14 text \15 > "\u0001" and "\u0014" are separate character runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)