[ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095904#comment-13095904
 ] 

Jukka Zitting commented on TIKA-683:
------------------------------------

+1, I'm eager to see us drop the javax.swing dependency with something we can 
directly fix and improve.

The org.apache.tika.sax.SaveContentHandler class already does some sanitization 
of SAX events, so that might be a good place to also check that tags are 
correctly nested. Though as Uwe said, ideally the generator of the SAX events 
would already take care of producing valid output.

PS. I'd rather use a separate .java file for the ExtractRTFText class than have 
it as a static inner class inside RTFParser. We can keep it package-private if 
we don't want to expose it directly to downstream clients.

> RTF Parser issues with non european characters
> ----------------------------------------------
>
>                 Key: TIKA-683
>                 URL: https://issues.apache.org/jira/browse/TIKA-683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, 
> TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, 
> testUnicodeUCNControlWordCharacterDoubling.rtf, 
> testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to