[
https://issues.apache.org/jira/browse/TIKA-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920365#action_12920365
]
Cristian Vat commented on TIKA-422:
-----------------------------------
Clarification:
The previous patch added some extra spaces after/between special (encoded)
characters such that "Übersicht" got transformed into "Ü bersicht"
This is because some characters are encoded using the hex notation "\'hh" so
they should not have spaces after the text run, but they are lower than 255.
As stated in the specification, RTF is 7-bit, so actually everything above 127
will be encoded as \'hh in the file.
Also, regarding the charset conversion:
I tried with some documents in Czech with which I had problems before and now
they are correctly extracted if I test using a RTF file created using WordPad.
However, with RTF files output from Microsoft Word there are still some issues.
Hope to investigate more tomorrow.
> Wrong charset conversion in some RTF documents.
> -----------------------------------------------
>
> Key: TIKA-422
> URL: https://issues.apache.org/jira/browse/TIKA-422
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Reporter: Piotr Bartosiewicz
> Attachments: RTFParser.patch, RTFParser.patch, test-windows-1250.rtf
>
>
> RTF parser uses javax.swing.text.rtf, but it sucks.
> It doesn't support '\ansicpg' tag (cite from RTF file format specification:
> "This keyword represents the default ANSI code page used to perform the
> Unicode to ANSI conversion when writing RTF text").
> Unfortunately Windows WordPad saves nonascii characters using \ansicpg
> instead of supported by javax.swing.text.rtf unicode characters.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.