[jira] [Commented] (TIKA-1094) Bugged WordExtractor#handleSpecialCharacterRun method

Konrad Tendera (JIRA) Tue, 19 Mar 2013 08:49:19 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606423#comment-13606423
 ]


Konrad Tendera commented on TIKA-1094:
--------------------------------------

two files with very simple content:
https://docs.google.com/file/d/0B_Y3ynKUuNhJNXlWanJGVGZ1Tm8/edit?usp=sharing
https://docs.google.com/file/d/0B_Y3ynKUuNhJU21ZRmZWR0E2cFE/edit?usp=sharing
                
> Bugged WordExtractor#handleSpecialCharacterRun method
> -----------------------------------------------------
>
>                 Key: TIKA-1094
>                 URL: https://issues.apache.org/jira/browse/TIKA-1094
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Konrad Tendera
>            Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> As javadoc says, special character runs are defined as follow:
> "Can be \13..text..\15 or \13..control..\14..text..\15"
> In fact there are some serious differences which causes that e.g. hyperlinks 
> aren't parsed properly. I checked it using LibreOffice and Microsoft Office 
> and I figured out that paragraph containing HYPERLINK looks rather like that:
> \13 (space here)HYPERLINK "address here" \1 \14 text \15
> "\u0001" and "\u0014" are separate character runs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1094) Bugged WordExtractor#handleSpecialCharacterRun method

Reply via email to