[
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080488#comment-13080488
]
Cristian Vat commented on TIKA-683:
-----------------------------------
I managed to take the original file and slim it down to (possibly) the smallest
test case. See "testUnicodeUCNControlWordCharacterDoubling.rtf, 566 bytes.
Test file contains only one character ( \u5E74 ). Checked with latest Tika SVN
and it is doubled.
The character is defined both as a RTF Unicode escape ( \uXXXX ) and as two RTF
charset/font-specific byte escapes ( \'xx ).
The file is correct since it does specify a unicode skip but that is not taken
into account.
Checked only with RTFEditorKit and that parses fine.
This is most likely caused by the changes in TIKA-422 which don't take into
account \ucN control word and thus show both versions of the character.
I'll try to look over the code and see what can be done.
Note on issue name: Current name isn't very accurate. The doubling could also
occur with european characters, it all depends on how the rtf generator chooses
to encode some characters. A better one would be: "RTFParser doubling
characters in some RTF files".
> RTF Parser issues with non european characters
> ----------------------------------------------
>
> Key: TIKA-683
> URL: https://issues.apache.org/jira/browse/TIKA-683
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Nick Burch
> Attachments: testRTFJapanese.rtf,
> testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira