[
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095904#comment-13095904
]
Jukka Zitting commented on TIKA-683:
------------------------------------
+1, I'm eager to see us drop the javax.swing dependency with something we can
directly fix and improve.
The org.apache.tika.sax.SaveContentHandler class already does some sanitization
of SAX events, so that might be a good place to also check that tags are
correctly nested. Though as Uwe said, ideally the generator of the SAX events
would already take care of producing valid output.
PS. I'd rather use a separate .java file for the ExtractRTFText class than have
it as a static inner class inside RTFParser. We can keep it package-private if
we don't want to expose it directly to downstream clients.
> RTF Parser issues with non european characters
> ----------------------------------------------
>
> Key: TIKA-683
> URL: https://issues.apache.org/jira/browse/TIKA-683
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Nick Burch
> Assignee: Michael McCandless
> Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch,
> TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf,
> testUnicodeUCNControlWordCharacterDoubling.rtf,
> testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira