[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Cristian Vat (JIRA) Sat, 06 Aug 2011 16:27:53 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080488#comment-13080488
 ]


Cristian Vat commented on TIKA-683:
-----------------------------------

I managed to take the original file and slim it down to (possibly) the smallest 
test case. See "testUnicodeUCNControlWordCharacterDoubling.rtf, 566 bytes.

Test file contains only one character ( \u5E74 ). Checked with latest Tika SVN 
and it is doubled.

The character is defined both as a RTF Unicode escape ( \uXXXX ) and as two RTF 
charset/font-specific byte escapes ( \'xx ).
The file is correct since it does specify a unicode skip but that is not taken 
into account.

Checked only with RTFEditorKit and that parses fine.
This is most likely caused by the changes in TIKA-422 which don't take into 
account \ucN control word and thus show both versions of the character.
I'll try to look over the code and see what can be done.

Note on issue name: Current name isn't very accurate. The doubling could also 
occur with european characters, it all depends on how the rtf generator chooses 
to encode some characters. A better one would be: "RTFParser doubling 
characters in some RTF files".

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: testRTFJapanese.rtf, 
> testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

Reply via email to