[
https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906512#action_12906512
]
kirby zhou commented on XERCESC-1936:
-------------------------------------
Is it confirmed by someone?
> ICUTransService and IconvGNUransService CAN NOT deal with huge file.
> --------------------------------------------------------------------
>
> Key: XERCESC-1936
> URL: https://issues.apache.org/jira/browse/XERCESC-1936
> Project: Xerces-C++
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.8.0, 3.1.1
> Environment: RHEL-5.5
> glibc-2.5-49.el5_5.2
> libicu-3.6-5.11.4
> Reporter: kirby zhou
>
> If a huge file passed to XMLReader, it will call TransService mulitple times,
> and splite the file content into several fragments.
> Unfortunately, the fragment will contain incomplete multi-byte characters.
> But neither ICUTransService nor IconvGNUransService deal with it.
> ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and
> IconvGNUransService did not deal with EINVAL.
> Both 2.8.0 and 3.1.1 have the same bug.
> For example, make 2 XML like that:
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for
> ((i=0;i<2;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) >
> ~/small.xml
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for
> ((i=0;i<100000;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) >
> ~/big.xml
> # the small.xml and big.xml are analogical.
> ]# samples/SAXPrint -x=gbk ~/small.xml
> <?xml version="1.0" encoding="gbk"?>
> <data>
> 中文汉字A中文汉字A
> </data>
> # with icu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
> Message: char 0x6C49 is not representable in 'gbk' encoding
> # with iconvgnu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> ]# samples/SAXPrint -x=gbk ~/big.xml
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
> Message: invalid multi-byte sequence
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]