Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Steve Lawrence Tue, 02 Jan 2018 08:30:48 -0800

Comments inline.

> 
> This memo describes a proposed feature for expressing data stream pre/post 
> processing operations.
> 
> Most of the discussion here will use parsing as context, but where the 
> unparsing is not clearly symmetric, unparsing will also be described.
> 
> New DFDL schema annotations are shown in the "daf:" namespace so as to be 
> clear what are DFDL standard, and what the new extensions are.
> 
> 
> The core concept is a cluster of new properties.
> 
> * streamEncoding (literal string or DFDL expression)
> * streamLengthKind (can be explicit, delimited, pattern, endOfParent, 
> prefixed) 
> * streamLength - used for lengthKind 'explicit'
> * streamLengthUnits (bits or bytes)
> * streamLengthPattern - used for lengthKind 'pattern'
> * streamTerminator - (literal string or DFDL expression) - used for 
> lengthKind delimited - not used nor allowed for other length kinds (TBD 
> asymmetric with terminator on a non-delimited element)
> * streamEscapeSchemeRef - used for lengthKind delimited to escape the 
> streamTerminator when necessary.
> 
> Those properties are valid on the DFDL annotation elements dfdl:format, 
> dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group.
> 
> An additional non-format property:
> 
> * streamTransform
> 
> This cannot appear on dfdl:format. Only on dfdl:element, dfdl:simpleType, 
> dfdl:sequence, dfdl:choice, dfdl:group.
> 
> Specifying the streamTransform property puts a stream transform into use. The 
> streamTransform property is specifically not allowed on dfdl:format because 
> it is not sensible to put a stream transformation into effect across a 
> lexical scope. Stream transforms apply to the dynamic scope of the Term they 
> are associated with.
> 
> (This might not work out. It may be of value to define streamTransform in a 
> format, if that format is named, and only referenced from the term that 
> defines the dynamic scope where that stream transform is to be used. If we 
> allow streamTransform on dfdl:format annotations, there are just certain 
> situations where we would want SDE errors to be detected, such as if 
> streamTransform is in lexical scope over a file.)
>


What about making this similar to escape schemes with something like
dfdl:defineStreamTransform/dfdl:streamTransformRef? Seems kindof similar
in usage to escape schemes and maybe helps with the scoping issue?

> A data stream is conceptually a stream of bytes. It can be an input stream 
> for parsing, an output stream for unparsing.
> Use of the term "stream" here is consistent with java's use of stream as in 
> InputStream and OutputStream. These are sources and sinks of bytes. If one 
> wants to decode characters from them you must do so by specifying the 
> encoding explicitly.
> 
> A stream transform is a layering to create one stream of bytes from another. 
> An underlying stream is encapsulated by a transformation to create an 
> overlying stream.
> 
> When parsing, reading from the overlying stream causes reading of data from 
> the underlying stream, which data is then transformed and becomes the bytes 
> of the overlying stream returned from the read.
> 
> The stream properties apply to the underlying stream data and indicate how to 
> identify its bounds/length, and if a stream transform is textual, what 
> encoding is used to interpret the underlying bytes.
> 
> Some transformations are naturally binary bytes to bytes. Data 
> decompress/compress are the typical example here. When parsing, the overlying 
> stream's bytes are the result of decompression of the underlying stream's 
> bytes.
> 
> If a transform requires text, then a stream encoding must be defined. For 
> example, base64 is a transform that creates bytes from text. Hence, a stream 
> encoding is needed to convert the underlying stream of bytes into text, then 
> the base64 decoding occurs on that text, which produces the bytes of the 
> overlying stream.
> 
> We think of some transforms as text-to-text. Line folding/unfolding is one 
> such. Lines of text that are too long are wrapped by inserting a line-ending 
> and a space. As a DFDL stream transform this line folding transform requires 
> an encoding. The underlying bytes are decoded into characters according to 
> the encoding. Those characters are divided into lines, and the line unfolding 
> (for parsing) is done to create longer lines of data, the resulting data is 
> then encoded from characters back into bytes using the same encoding.
> 
> (There may be opportunities to shortcut these transformations if the 
> overlying stream is the data stream for an element with scannable text 
> representation using the same character set encoding.)
> 
> DFDL can describe a mixture of character set decoding/encoding and binary 
> value parsing/unparsing against the same underlying data representation; 
> hence, the underlying data stream concept is always one of bytes.
> 
> (TBD: maybe it has to be bits? E.g., in mil-std-2045 headers, the VMF payload 
> data can be compressed. I don't know that this payload data always begins on 
> a byte boundary.)

A VMF message has two parts, an application header and user data. The
spec says, "The application header shall always be a multiple of 8 bits.
If an application header is not a multiple of 8 bits, it shall be zero
filled so that it becomes a multiple of 8 bits.". So the user data part
(the part that could be compressed) must always start on a byte
boundary. Similarly the user data field is also always filled to a byte
boundary. The supported compression algorithms are LZW and GZIP, which
both only work on bytes. So as far as MIL-STD-2045/VMF, bits should not
be necessary, and makes things much easier.

> Daffodil parsing begins with a default standard data input stream. Unparsing 
> begins with a default standard output stream.
> 
> When a DFDL schema wants to describe say, base64 decoding the DFDL 
> annotations might look like this:
> 
> <element name="foo" daf:streamTransform="base64">
>   <complexType>
>     <sequence>
>       ....
>     </sequence>
>   </complexType>
> </element>
> 
> This annotation means: when parsing element foo, take whatever data stream is 
> in effect, layer a base64 data stream on it, and use that until the end of 
> element foo. The streamEncoding property would be taken from the lexically 
> enclosing format. 
> 
> In this example, when element foo is being parsed, the current data input 
> stream is augmented by being encapsulated in a base64 transformer. This 
> transformer takes the data stream, decodes it to characters using the 
> streamEncoding, then processes the resulting text converting base64 to binary 
> data.
> 
> The APIs for defining the base64 or other transformers enable one to do these 
> transformations in a streaming manner, on demand as data is pulled from the 
> resulting data stream of bytes. Of course it is possible to just convert the 
> entire data object, but we want to enable streaming behavior in case 
> stream-encoded objects are large.
> 
> We just have seen how the dfdl:streamEncoding property is used by element foo 
> as part of the dataStream transformation.
> 
> Let's consider how streamLength works.
> 
> There are two things we have to describe the length of now. One is the data 
> that is to be transformed. The second is the length of the parsed element 
> taken from the result of the transformation.
> 
> One may have a base64 encoded region of 1000 bytes streamLength, within that, 
> once decoded one will have only 750 or so bytes available. That data is 
> limited by the 750 length of the decoded data. At the time parsing begins 
> neither of these numbers 1000, nor 750 may be known. 
> 
> <dfdl:defineFormat name="fooStreamFormat">
>   <dfdl:format streamEncoding="utf-16" streamLengthKind="explicit"/>
> </dfdl:defineFormat>
> 
> This data stream will decode utf-16 characters on the underlying data stream, 
> then base64 decode that to get a stream of bytes.
> 
> <dfdl:defineFormat name="fooFormat">
>   <dfdl:format ref="tns:fooStreamFormat" encoding="utf-8" 
> byteOrder="bigEndian"/>
> </dfdl:defineFormat>
> 
> Then the type 
> 
> <element name="len" type="xs:int".../>
> <element name="foo" dfdl:ref="tns:fooFormat" type="tns:fooType" 
> dfdl:initiator="foo:"
>        daf:streamLength="{ ../len }" daf:streamTransform="base64"/>
> 
> Note how the property daf:streamLength is supplied where the expression is 
> relevant, but the other properties controling the stream processing are 
> expressed reusably.
> 
> In this example, we have that the dfdl:initiator for foo will be decoded in 
> utf-8 characters from the byte-stream produced by the base64 transform. 
> However, that base64 data was decoded from UTF-16 decode of the underlying 
> byte stream. 
> 
> For the unparse direction, this len element needs a dfdl:outputValueCalc. The 
> calculation needs the length of the base64 encoded data.
> 
> This would be expressed as
> 
> <element mame="len" type="xs:int" dfdl:outputValueCalc="{ 
> daf:streamLength(../foo, 'bytes') }"/>
> 
> This function daf:streamLength is much like dfdl:valueLength and 
> dfdl:contentLength, except that it accesses the
> underlying data stream representation. The units are 'bits', 'bytes' or 
> 'characters'. If 'characters' is specified, then the value returned is the
> number of characters in the data stream's encoding of the data. In the 
> example above, this would be the number of utf-16 characters
> in the underlying stream before base64 decoding takes place.

Is this last sentence correct? I would expect that it would output the
length of base64 encoded data, and I would expect valueLength to return
the number of utf-16 characters before base64 encoded the data?

> ('characters' may not be needed.)
> 
> If the units are specified as 'bytes' then the length in bytes of the 
> underlying data stream prior to transformation, is provided.
> 
> ('bits' may or may not be needed, or if provided perhaps we get away with it 
> just being like 'bytes' * 8 and require lengths to be multiple of a byte.)
> 
> Let's look at an example of two interacting data stream transforms.
> 
> <xs:sequence
>   daf:streamEncoding="utf-8" daf:streamTransform="foldedLines" 
> daf:streamLengthKind="delimited">

streamLengthKind is delimited, but no streamTerminator is defined? How
does it know when to stop transforming folding lines. Does it look at
parent terminating markup? In which case, what is the purpose of
streamTerminator? Maybe this is just an omission? Or maybe
streamLengthKind should be endOfParent?

Related, how does parent terminating markup interact with a delimited
length stream. My assumption based on what you've said is that such
terminating markup is ignored and only applied after the transform? This
makes sense in the line folding case where we want to ignore terminating
markup until after the line folding is removed.

In that case, what if a stream is terminated by parent terminating
markup? Do you duplicate the delimiter in daf:streamTerminator? For
example, lets say we have an unbounded comma separated array of base64
encoded utf-16 data. I would expect it to look like this:

<xs:sequence dfdl:separator="," dfdl:separatorPosition="infix">
  <xs:element name="utfString" type="xs:string" maxOccurs="unbounded"
    dfdl:encoding="utf-16" dfdl:occursCountKind="implicit"
    daf:streamTransform="base64"
    daf:streamEncoding="us-ascii"
    daf:streamLengthKind="delimited"
    daf:streamTerminator=","
</xs:sequence>

This raises some questions/issues:

1) separatorPosition is infix, so how does the the base64 transform know
to stop for the last element that isn't followed by a comma if it
ignores parent terminating markup. Does streamTerminator need to be
modified to include all parent terminating markup? That seems doable but
difficult and might make reuse hard. So maybe my assumption was wrong
that parent terminating markup is ignored? However, I imagine there are
some cases where do want to ignore parent terminating markup, like in
the line folding case. Maybe we need different delimited
daf:streamLengthKinds? One that ignores parent terminating markup and
one that doesn't?

2) Is it expected that the streamTerminator is not consumed by a
delimited streamTransform? And that it is the responsibility of the
surrounding data to consume it? That is inconsistent with
dfdl:terminator, but might make sense. So in the above example, the
transform will not consume the comma separators and stop short of it?
This makes sense to me, as otherwise both the utfString and the
surrounding sequence will want to consume the separator. But maybe the
parent terminator markup thing means the streamTerminator isn't set to a
comma?

>   ...
>   ... presumably everything here is textual, and utf-8 because foldedLines 
> only applies sensibly to text.
>   ...
>   <xs:sequence daf:streamEncoding="us-ascii" daf:streamTransform="base64" 
> daf:streamLengthKind="delimited" daf:streamTerminator="{ ../marker }">
>       ...
>       ... everything here is parsed against the bytes obtained from base64 
> decoding
>       ... which is itself decoding the output of the foldedLines transform
>       ... above. Base64 requires only us-ascii, which is a subset of utf-8.
>       ...
>   </xs:sequence>
> </xs:sequence>
> 
> Summary
> * allows stacking transforms one on top of another. So you can have base64 
> encoded compressed data as the payload representation of
> a child element within a larger element.
> * allows specifying properties of the underlying data stream separately from 
> the properties of the logical data.
> * scopes the transforms over a term (model-group or element)
> * prevents inadvertent lexical scoping of a streamTransform from a lexically 
> enclosing top level format annotation.
> 
> 
> Implementation Notes:
> 
> Introduction of a stream transform basically appears in the Term grammar as a 
> combinator that surrounds the contained Term contents.
> 

I really like this proposal, seems like the correct approach. Some
things that might be worth adding:

1) How are errors handled in a stream (e.g. trying to gunzip but its not
a valid gzip stream)? Are they just ProcessingErrors and things
backtrack as usual using standard PoC's?

2) What about options for a transform? For example, you might want to
specify a gzip stream to do something like --best or --fast to favor
compression size vs speed. Or what variation of base64 should be used.
Might also used to describe how errors should be handled specific to a
transform. For example, base64 can ignore garbage characters when
decoding, but that might want to be a processing error in some cases.

I guess this could be a single option with space separated key/value
pairs, e.g.

  daf:streamTransformOptions="base64_ignore_garbage=yes
base64_variant=rfc1421"

That's very extensible, but might not be consistent with the rest of
DFDL. Maybe we need specific options for each stream transform, e.g.

  daf:streamTransformBase64IgnoreGarbage="yes"
  daf:streamTransformBase64Variant="rfc1421"
  ..

- Steve

Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Reply via email to