[jira] [Commented] (DAFFODIL-639) unicodeByteOrderMark feature

Michael Beckerle (JIRA) Tue, 23 Oct 2018 09:59:23 -0700


    [ 
https://issues.apache.org/jira/browse/DAFFODIL-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660965#comment-16660965
 ]


Michael Beckerle commented on DAFFODIL-639:
-------------------------------------------

DFDL Workgroup is discussing (Oct 2018) whether all this BOM stuff should be 
optional functionality. If so then we're unlikely to implement this at all.


> unicodeByteOrderMark feature
> ----------------------------
>
>                 Key: DAFFODIL-639
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-639
>             Project: Daffodil
>          Issue Type: New Feature
>          Components: Back End, DFDL Language
>            Reporter: Michael Beckerle
>            Priority: Minor
>
> This is not a property. The unicodeByteOrderMark is a member of the Infoset 
> Document Item. (aka the root element). 
> It depends on the dfdl:encoding property, which can be a runtime expression; 
> hence, this must be computed in an Evaluatable which in turn evaluates the 
> encodingEv.
> Likely an Evaluatable[Option[ByteOrder]] is the type. 
> If no encoding property is defined this should be a constant None. 
> If the encoding property is defined and known to NOT be one of UTF-8, UTF-16, 
> or UTF-32, then this should be a constant None. 
> When unparsing, the value will either have been set from parsing, or can be 
> set from an API call. (New API method on Infoset needed.)
> The API call is allowed, but the value ignored/unused by the unparser unless 
> the encoding is UTF-8, UTF-16, or UTF-32. 
> When the encoding evaluates to UTF-8, then the unicodeByteOrderMark will be 
> determined by the first 3 bytes being:
> * 0xEF 0xBF 0xBE - ByteOrder.LittleEndian - 3 bytes are consumed (note: 
> strictly speaking, this shouldn't occur, but will if a naive utf-8 encoder 
> encodes a little-endian BOM into a 3-byte UTF-8 sequence. To insure such data 
> will round trip between UTF-8 and UTF-16 (LE - via BOM), we match this 
> sequence, and choose LittleEndian byte order)
> * 0xEF 0xBB 0xBF - ByteOrder-BigEndian - 3 bytes are consumed
> * anything else - no bytes are consumed, and the unicodeByteOrderMark is not 
> set (has no value)
> when unparsing, if unicodeByteOrderMark is not set, then no byte order mark 
> is output. 
> For UTF-16,
> * 0xFE 0xFF - byteOrder.BigEndian - 2 bytes are consumed
> * 0xFF 0xFE - byteOrder.LittleEndian - 2 bytes are consumed
> * anything else - parse error.
> When unparsing, if encoding is UTF-16, and unicodeByteOrderMark is not set - 
> unparse error.
> UTF-32 works like utf-16, except the byte patterns are 00 00 FE FF for 
> bigEndian, and FF FE 00 00 for littleEndian.
> Recommended: package this code for reuse, assuming it needs to be used as a 
> library for reading/decoding strings generally. It's not impossible that the 
> above runtime errors when the byte order is not known, will be augmented in 
> the future by a mode where each individual text string at fine granularity is 
> examined for a byte order mark at the start.  There also may be a need for 
> utf-16 heuristic byte-order determination - that is by looking at the bytes 
> for the characters and determining if they make more sense as big-endian or 
> little endian. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DAFFODIL-639) unicodeByteOrderMark feature

Reply via email to