[
https://issues.apache.org/jira/browse/DAFFODIL-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660965#comment-16660965
]
Michael Beckerle commented on DAFFODIL-639:
-------------------------------------------
DFDL Workgroup is discussing (Oct 2018) whether all this BOM stuff should be
optional functionality. If so then we're unlikely to implement this at all.
> unicodeByteOrderMark feature
> ----------------------------
>
> Key: DAFFODIL-639
> URL: https://issues.apache.org/jira/browse/DAFFODIL-639
> Project: Daffodil
> Issue Type: New Feature
> Components: Back End, DFDL Language
> Reporter: Michael Beckerle
> Priority: Minor
>
> This is not a property. The unicodeByteOrderMark is a member of the Infoset
> Document Item. (aka the root element).
> It depends on the dfdl:encoding property, which can be a runtime expression;
> hence, this must be computed in an Evaluatable which in turn evaluates the
> encodingEv.
> Likely an Evaluatable[Option[ByteOrder]] is the type.
> If no encoding property is defined this should be a constant None.
> If the encoding property is defined and known to NOT be one of UTF-8, UTF-16,
> or UTF-32, then this should be a constant None.
> When unparsing, the value will either have been set from parsing, or can be
> set from an API call. (New API method on Infoset needed.)
> The API call is allowed, but the value ignored/unused by the unparser unless
> the encoding is UTF-8, UTF-16, or UTF-32.
> When the encoding evaluates to UTF-8, then the unicodeByteOrderMark will be
> determined by the first 3 bytes being:
> * 0xEF 0xBF 0xBE - ByteOrder.LittleEndian - 3 bytes are consumed (note:
> strictly speaking, this shouldn't occur, but will if a naive utf-8 encoder
> encodes a little-endian BOM into a 3-byte UTF-8 sequence. To insure such data
> will round trip between UTF-8 and UTF-16 (LE - via BOM), we match this
> sequence, and choose LittleEndian byte order)
> * 0xEF 0xBB 0xBF - ByteOrder-BigEndian - 3 bytes are consumed
> * anything else - no bytes are consumed, and the unicodeByteOrderMark is not
> set (has no value)
> when unparsing, if unicodeByteOrderMark is not set, then no byte order mark
> is output.
> For UTF-16,
> * 0xFE 0xFF - byteOrder.BigEndian - 2 bytes are consumed
> * 0xFF 0xFE - byteOrder.LittleEndian - 2 bytes are consumed
> * anything else - parse error.
> When unparsing, if encoding is UTF-16, and unicodeByteOrderMark is not set -
> unparse error.
> UTF-32 works like utf-16, except the byte patterns are 00 00 FE FF for
> bigEndian, and FF FE 00 00 for littleEndian.
> Recommended: package this code for reuse, assuming it needs to be used as a
> library for reading/decoding strings generally. It's not impossible that the
> above runtime errors when the byte order is not known, will be augmented in
> the future by a mode where each individual text string at fine granularity is
> examined for a byte order mark at the start. There also may be a need for
> utf-16 heuristic byte-order determination - that is by looking at the bytes
> for the characters and determining if they make more sense as big-endian or
> little endian.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)