[
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085021#comment-13085021
]
Michael McCandless commented on TIKA-683:
-----------------------------------------
NOTE: I know very little about RTF! So please forgive/correct any
confusions below:
It looks like we need a stack to record the \ucN control chars we've
encountered, at each depth, and we must then skip N ansi chars after
each \uXXXX we see? (Similarly to how we track the charset with
charsetQueue now).
Ie, on seeing \uXXXX (possibly followed by trailing space, which does
not count in the skip count), we parse and keep that XXXX unicode
character, re-emitting the \uXXXX in our output data, but then we
remove the following N ansi chars.
Some other things I noticed in RTFParser.java; I'm not sure if they
are really a problem in pratice:
* I'm worried about how we replace \cell with \u0020\cell --
depending on the last \ucN control word, this could mean we
incorrectly skip some number of ansi chars? Changing to
{\u20}\cell would be safer since on group end the pending skip
chars are reset to 0.
* But then I also wonder if all the additional groups we are
creating (because we surround each \uXXXX with { }) are somehow
costly, eg if it causes RTFEditorKit to use more RAM / be slower /
something.
* When we look for the \ansicpgNNNN control word, I noticed we then
look up the NNNN in the FONTSET_MAP -- is that wrong? EG when I
look at the possible values for NNNN (at
http://latex2rtf.sourceforge.net/rtfspec_6.html) I see a bunch of
numbers that aren't in the FONTSET_MAP. We also use FONTSET_MAP
for \fcharsetNNN but the values for that control word look
correct.
* We don't seem to handle the opening charset in the RTF header (ie,
\ansi, \mac, \pc, \pca)?
> RTF Parser issues with non european characters
> ----------------------------------------------
>
> Key: TIKA-683
> URL: https://issues.apache.org/jira/browse/TIKA-683
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Nick Burch
> Attachments: TIKA-683.patch, testRTFJapanese.rtf,
> testUnicodeUCNControlWordCharacterDoubling.rtf
>
>
> As reported on user@ in "non-West European languages support":
>
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira