[
https://issues.apache.org/jira/browse/DAFFODIL-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576375#comment-16576375
]
Michael Beckerle commented on DAFFODIL-1979:
--------------------------------------------
See PR https://github.com/apache/incubator-daffodil/pull/93
> UTF8 decoder doesn't handle 3-byte and 4-byte correctly
> -------------------------------------------------------
>
> Key: DAFFODIL-1979
> URL: https://issues.apache.org/jira/browse/DAFFODIL-1979
> Project: Daffodil
> Issue Type: Bug
> Components: Back End
> Affects Versions: 2.2.0
> Reporter: Michael Beckerle
> Assignee: Michael Beckerle
> Priority: Major
> Fix For: 2.2.0
>
>
> It is classifying some valid characters as "overlong" and erroring out.
> The PNG schema on DFDLSchemas github has 1 test that runs into this bug on 3
> byte Devangari script characters.
> This is 6 devangari characters: e0 a4 b6 e0 a5 80 e0 a4 b0 e0 a5 8d e0 a4 b7
> e0 a4 95
> Should be: शीर्षक
> But is coming out all substitution chars.
> In 3 byte utf-8, the bits that at least one of must be non-zero are shown
> here in M, notice one of them is in the second byte. This second byte wasn't
> being tested.
> 1110MMMM 10Mxxxxx 10xxxxxx
> In 4 byte utf-8, the bits that must at least one of be non-zero are:
> 11110 MMM 10MMxxxx 10xxxxxx 10xxxxxx
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)