Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Mike Beckerle Fri, 05 Jan 2018 15:29:18 -0800

Updated proposal attached.

________________________________
From: Steve Lawrence <slawre...@apache.org>
Sent: Thursday, January 4, 2018 9:28:51 AM
To: Mike Beckerle
Subject: Re: Please review & discuss - draft proposal for how to do base64, 
foldedLines, etc.


<-- snip -->

> 2) What about options for a transform? For example, you might want to
> specify a gzip stream to do something like --best or --fast to favor
> compression size vs speed. Or what variation of base64 should be used.
> Might also used to describe how errors should be handled specific to a
> transform. For example, base64 can ignore garbage characters when
> decoding, but that might want to be a processing error in some cases.
>
> I guess this could be a single option with space separated key/value
> pairs, e.g.
>
>    daf:streamTransformOptions="base64_ignore_garbage=yes
> base64_variant=rfc1421"
>
> That's very extensible, but might not be consistent with the rest of
> DFDL. Maybe we need specific options for each stream transform, e.g.
>
>    daf:streamTransformBase64IgnoreGarbage="yes"
>    daf:streamTransformBase64Variant="rfc1421"
>    ..
>
> MikeB: My suggestion would be to make these parameters part of the algorithm
> name for now. E.g.,
> daf:streamTransform="base64Best" or 
> daf:streamTransform="base64_ignore_garbage".
>
> We're going to need a way to specify many of these stream transforms. 
> Specifying
> gzip with options
> and naming it something new better not be very hard. So perhaps that is good
> enough for now.
>

My only (minor) concern with this is that if something had multiple
options, the combinations of names could expand quickly. But probably
not worth worrying about until that actually happens--it may not be an
issue in practice.

Everything else above sounds good.

(revised 2018-01-04)

This memo describes a proposed feature for expressing data stream pre/post 
processing operations.

Most of the discussion here will use parsing as context, but where the 
unparsing is not clearly symmetric, unparsing will also be described.

New DFDL schema annotations are shown in the "daf:" namespace so as to be clear 
what are DFDL standard, and what the new extensions are.

The core concept is a new annotation element daf:stream.

This is similar to dfdl:element. It can hold the same properties - except those 
for array behavior (occursCountKind, etc>)

It also has one additional property: transformation=<transformName>

The transform names are NCNames - all reserved. In the future this may become 
extensible, allowing QNames to be used.

The initial transform names will include "base64", "lineFolding", "compress", 
and perhaps others.

A daf:stream annotation describes a stream transformation. Important properties 
for a stream transformation include:

* encoding (literal string or DFDL expression)
* lengthKind (can be implicit, explicit, delimited, pattern, endOfParent, 
prefixed) 
* length - used for lengthKind 'explicit'
* lengthPattern - used for lengthKind 'pattern'
* terminator - (literal string or DFDL expression) - used for lengthKind 
delimited - not used nor allowed for other length kinds (TBD asymmetric with 
terminator on a non-delimited element)
* initiator - (literal string or DFDL expression) - used for lengthKind 
delimited - not used nor allowed for other length kinds 

The lengthKind 'implicit' means that the transform algorithm itself determines 
when the encoded data ends.

Unspecified properties are not inherited from any other place. Not from the 
default dfdl:format annotation, nor any other defined stream format. 

Some properties allowed on dfdl:element annotations are not allowed on 
dfdl:stream annotations. These restrictions may be removed in the future if the 
capabilities they would provide prove to be needed, but initially we expect 
that the following properties allowed on dfdl:element would not be allowed on 
dfdl:stream:

* escapeSchemeRef - delimited streams are assumed to not need escape schemes.
* lengthUnits (assumed always to be bytes)
* alignment (assumed to be byte aligned)
* alignmentUnits (assumed to be byte aligned)

An additional non-format property:

* daf:streamRef=QName

can appear on dfdl:element, dfdl:sequence or dfdl:choice annotations.  It 
allows convenient use of a named daf:stream anotation.

A named stream annotation is created via

* <daf:defineStream name=NCName >
    <daf:stream .../>
 </daf:defineStream>

The daf:defineStream can only appear at top level (as an annotation of the 
xs:schema element.) The named stream is in the schema's target namespace.

Specifying the streamRef property puts a stream transform into use. The 
streamRef property is specifically not allowed on dfdl:format because it is not 
sensible to put a stream transformation into effect across a lexical scope. 
Stream transforms apply to the dynamic scope of the Term they are associated 
with.

(This might not work out. It may be of value to define streamRef in a format, 
if that format is named, and only referenced from the term that defines the 
dynamic scope where that stream transform is to be used. If we allow streamRef 
on dfdl:format annotations, there are just certain situations where we would 
want SDE errors to be detected, such as if streamTransform is in lexical scope 
over a file.)

A data stream is conceptually a stream of bytes. It can be an input stream for 
parsing, an output stream for unparsing.
Use of the term "stream" here is consistent with java's use of stream as in 
InputStream and OutputStream. These are sources and sinks of bytes. If one 
wants to decode characters from them you must do so by specifying the encoding 
explicitly.

A stream transform is a layering to create one stream of bytes from another. An 
underlying stream is encapsulated by a transformation to create an overlying 
stream.

When parsing, reading from the overlying stream causes reading of data from the 
underlying stream, which data is then transformed and becomes the bytes of the 
overlying stream returned from the read.

The stream properties apply to the underlying stream data and indicate how to 
identify its bounds/length, and if a stream transform is textual, what encoding 
is used to interpret the underlying bytes.

Some transformations are naturally binary bytes to bytes. Data 
decompress/compress are the typical example here. When parsing, the overlying 
stream's bytes are the result of decompression of the underlying stream's bytes.

If a transform requires text, then a stream encoding must be defined. For 
example, base64 is a transform that creates bytes from text. Hence, a stream 
encoding is needed to convert the underlying stream of bytes into text, then 
the base64 decoding occurs on that text, which produces the bytes of the 
overlying stream.

We think of some transforms as text-to-text. Line folding/unfolding is one 
such. Lines of text that are too long are wrapped by inserting a line-ending 
and a space. As a DFDL stream transform this line folding transform requires an 
encoding. The underlying bytes are decoded into characters according to the 
encoding. Those characters are divided into lines, and the line unfolding (for 
parsing) is done to create longer lines of data, the resulting data is then 
encoded from characters back into bytes using the same encoding.

(There may be opportunities to shortcut these transformations if the overlying 
stream is the data stream for an element with scannable text representation 
using the same character set encoding.)

DFDL can describe a mixture of character set decoding/encoding and binary value 
parsing/unparsing against the same underlying data representation; hence, the 
underlying data stream concept is always one of bytes.

(Note: bytes suffices even for mil-std-2045 which can hold a compressed VMF 
payload. This payload element is always byte aligned even in mil-std-2045, a 
very bit-oriented format.)

Daffodil parsing begins with a default standard data input stream. Unparsing 
begins with a default standard output stream.

When a DFDL schema wants to describe say, base64 decoding the DFDL annotations 
might look like this:

<annotation><appinfo source="http://www.ogf.org/dfdl/";>
  <daf:defineStream name="compressed">
    <daf:stream transform="gzip" lengthKind="implicit" />
  </daf:defineStream>
</appinfo></annnotation>

<element name="foo" daf:streamRef="tns:compressed">
  <complexType>
    <sequence>
      ....
    </sequence>
  </complexType>
</element>

This annotation means: when parsing element foo, take whatever data stream is 
in effect, layer a gzip data stream on it, and use that until the end of the 
gzipped data - in this case until the gzip transform itself determines that the 
compressed data has ended.

The APIs for defining the gzip, base64, or other transformers enable one to do 
these transformations in a streaming manner, on demand as data is pulled from 
the resulting data stream of bytes. Of course it is possible to just convert 
the entire data object, but we want to enable streaming behavior in case 
stream-encoded objects are large.

We just have seen how the daf:streamRef property is used by element foo as part 
of the data stream transformation.

Let's consider how a stream's length works.

There are two things we have to describe the length of now. One is the data 
stream that is to be transformed. The second is the length of the parsed 
element taken from the result of parsing the stream.

One may have a base64 encoded region of 1000 bytes streamLength, within that, 
once decoded one will have only 750 or so bytes available. That data is limited 
by the 750 length of the decoded data. At the time parsing begins neither of 
these numbers 1000, nor 750 may be known. 

<daf:defineStream name="fooStream">
  <dfdl:stream encoding="utf-16" lengthKind="explicit" length="{ ../len }"/>
</daf:defineFormat>

This data stream will decode utf-16 characters on the underlying data stream, 
then base64 decode that to get a stream of bytes. The length is determined by 
evaluating the expression at the point of use.

<dfdl:defineFormat name="fooFormat">
  <dfdl:format daf:streamRef="tns:fooStream" encoding="utf-8" 
byteOrder="bigEndian"/>
</dfdl:defineFormat>

Then the type 

<element name="len" type="xs:int".../>
<element name="foo" dfdl:ref="tns:fooFormat" type="tns:fooType" 
dfdl:initiator="foo:"/>

In this example, we have that the dfdl:initiator for foo will be decoded in 
utf-8 characters from the byte-stream produced by the base64 transform. 
However, that base64 data was decoded from UTF-16 decode of the underlying byte 
stream. 

For the unparse direction, this len element needs a dfdl:outputValueCalc. The 
calculation needs the length of the base64 encoded data.

This would be expressed as

<element mame="len" type="xs:int" dfdl:outputValueCalc="{ 
daf:streamLength(../foo) }"/>

This function daf:streamLength is much like dfdl:valueLength and 
dfdl:contentLength, except that it accesses the
underlying data stream representation. The units are always bytes. 

Let's look at an example of two interacting data stream transforms.

<daf:defineStream name="foldedLines">
  <daf:stream transform="foldedLines" lengthKind="delimited"/>
</daf:defineStream>

<daf:defineStream name="base64">
  <daf:stream transform="base64" encoding="us-ascii" lengthKind="delimited" 
terminator='{ ./marker }'/>
  <!-- note expression above is ./marker, not ../marker -->
</daf:defineStream>

<xs:sequence daf:streamRef="tns:foldedLines">
  ...
  ... presumably everything here is textual, and utf-8 because foldedLines only 
applies sensibly to text.
  ...
  <xs:element name="marker" type="xs:string" .../>
  <xs:sequence daf:streamRef="tns:base64">
      ...
      ... everything here is parsed against the bytes obtained from base64 
decoding
      ... which is itself decoding the output of the foldedLines transform
      ... above. Base64 requires only us-ascii, which is a subset of utf-8.
      ...
  </xs:sequence>
</xs:sequence>

Summary
* allows stacking transforms one on top of another. So you can have base64 
encoded compressed data as the payload representation of
a child element within a larger element.
* allows specifying properties of the underlying data stream separately from 
the properties of the logical data.
* scopes the transforms over a term (model-group or element)
* prevents inadvertent lexical scoping of a stream transform from a lexically 
enclosing top level format annotation.


Implementation Notes:

Introduction of a stream transform basically appears in the Term grammar as a 
combinator that surrounds the contained Term contents.

Concrete Example:

Consider this VCALENDAR Data:

BEGIN:VCALENDAR
PRODID:
VERSION:1.0
BEGIN:VEVENT
DTSTART:20170903T170000Z
DTEND:20170903T173000Z
LOCATION:test location
UID:040000008200E00074C5B7101A82E0080000000010156B50B224D301000000000000000
        01000000083A43200A4E43F4E800BE12703B99BF0
DESCRIPTION;ENCODING=QUOTED-PRINTABLE:=
 Text that will require line folding: Lorem ipsum dolor sit amet, consecte=
 tur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore=
 magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco=
 laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor i=
 n reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla par=
 iatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui =
 officia deserunt mollit anim id est laborum.=0D=0A=0D=0A =0D=0A=0D=0A=0D==
 =0A
SUMMARY:test subject
PRIORITY:3
END:VEVENT
END:VCALENDAR

We want to create a schema that describes this. 

In the above there are two behaviors that require use of stream transforms. 
First is the UID. This has been broken to a maximum line length of 76 
characters by way of the folded-lines transformation.

The second is the DESCRIPTION which uses a transformation called 
QUOTED-PRINTABLE which both achieves short line lengths, and also enables 
embedding of CR, LF, and other characters at the ends of lines.

The result is that we want this XML Infoset:

<VCalendar>
  <ProdID>-//Microsoft Corporation//Outlook 15.0 MIMEDIR//EN</ProdID>
  <Version>1.0</Version>
  <VEvent>
    <DTStart></DTStart>
    <DTEnd></DTEnd>
    <Location>test location</Location>
    
<UID>040000008200E00074C5B7101A82E0080000000010156B50B224D30100000000000000001000000083A43200A4E43F4E800BE12703B99BF0</UID>
    <Description>
      <Encoding>QUOTED-PRINTABLE</ENCODING>
      <QP/>
      <Value>Text that will require line folding: Lorem ipsum dolor sit amet, 
consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et 
dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco 
laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in 
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia 
deserunt mollit anim id est laborum.&#xEOOD;
&#xEOOD;
 &#xEOOD;
&#xEOOD;
&#xEOOD;
</Value>
    </Description>
    <Summary>test subject</Summary>
    <Priority>3</Priority>
  </VEvent>
</VCalendar>

Notice the CRLFs at the end. Ive represented the CRs as remapped to PUA E00D 
entities.

The DFDL schema for this, including the specification of the stream transform 
behaviors.

<xs:schema ....>

<dfdl:format separatorPosition="infix" lengthKind="delimited" encoding="utf-8"
  occursCountKind="parsed" separator="" sequenceKind="ordered"
  separatorPosition="infix"/>

<daf:defineStream name="folded">
  <daf:stream transform="foldedLines" lengthKind="delimited" 
encoding="us-ascii"/>
  <!-- delimited here means to enclosing terminating markup, as no terminator 
is defined. -->
</daf:defineStream>

<daf:defineStream name="qp">
  <daf:stream transform="quotedPrintable" lengthKind="pattern"
     lengthPattern="[^\n]*?(?=(?<!=)\n)"/>
  <!-- QPs are terminated by a newline that is not preceded by an =. This final 
newline is not consumed as part of the content. -->
  <!-- Alternatively, the QP transform itself can determine the length by 
searching for this final newline (but leaving it there).
       In which case the lengthKind would be "implicit" -->
</daf:defineStream>

<xs:element name="VCalendar" dfdl:initiator="BEGIN:VCALENDAR%NL;" 
dfdl:terminator="END:VCALENDAR%NL; END:VCALENDAR">
  <xs:complexType>
    <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
      <xs:element name="ProdID" type="xs:string" dfdl:initiator="PRODID:" 
minOccurs="0" daf:streamRef="tns:folded"/>
      <xs:element name="Version" type="xs:string" dfdl:initiator="VERSION:" 
minOccurs="0" />
      <xs:element name="VEvent" maxOccurs="unbounded" minOccurs="0" 
dfdl:occursCountKind="parsed"
        dfdl:initiator="BEGIN:VEVENT%NL;" dfdl:terminator="END:VEVENT">
        <xs:complexType>
          <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
            <xs:element name="DTStart" type="xs:string" 
dfdl:initiator="DTSTART:" />
            <xs:element name="DTEnd" type="xs:string" dfdl:initiator="DTEND:" />
            <xs:element name="Location" type="xs:string" 
dfdl:initiator="LOCATION:" minOccurs="0"  daf:streamRef="tns:folded"/>
            <xs:element name="UID" type="xs:string" dfdl:initiator="UID:" 
minOccurs="0"  daf:streamRef="tns:folded"/>
            <xs:element name="Description" dfdl:initiator="DESCRIPTION:" 
minOccurs="0">
              <xs:complexType>
                <xs:sequence>
                  <xs:element name="Encoding" type="xs:string" 
dfdl:initiator="ENCODING=" dfdl:terminator=":" minOccurs="0" />
                  <xs:choice dfdl:choiceDispatchKey="{ if 
(fn:exists(./Encoding)) then ./Encoding else '' }">

                    <!-- we inspect the value of the Encoding element and 
decide what branch of the choice
                         based on it -->
                         
                    <xs:sequence dfdl:choiceBranchKey="QUOTED-PRINTABLE">
                      dfdl:separator="" dfdl:sequenceKind="unordered">
                      <!--
                        Each branch starts with a distinct dummy element to 
satisfy the UPA rules of XML Schema
                        -->
                      <xs:element name="QP" type="xs:string" 
dfdl:inputValueCalc="{ '' }" />

                      <!--
                        Here notice tha tthe streamRef for the qp data is 
scoped to just this inner element.
                        -->
                        
                      <xs:element name="Value" type="xs:string" 
dfdl:streamRef="tns:qp" />
                    </xs:sequence>

                   <!-- repeat the above pattern for the choice branches for 
the various encodings -->
                   
                  </xs:choice>
                </xs:sequence>
              </xs:complexType>
            </xs:element>
            <xs:element name="Summary" type="xs:string" 
dfdl:initiator="SUMMARY:" minOccurs="0"  daf:streamRef="tns:folded"/>
            <xs:element name="Priority" type="xs:string" 
dfdl:initiator="PRIORITY:" minOccurs="0" />
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>


</xs:schema>


Clarifications:

The stream transform, specified on an element, the transform applies to the 
entire EnclosedElement region of the grammar - so the transform applies to any 
framing.
On a model group, the transform applies to the entire Sequence region, entire 
Choice region of the grammar. So inclusive of all framing.
To exclude framing, such as initiator/terminator from transformation, you would 
encapsulate in a xs:sequence that carries the initiator and terminator, and
have the SequenceContent have the transformation only on it.



Example of multi-layer transformation:

Here's some CSV data

last,first,middle,DOB
smith,robert,brandon,1988-03-24
johnson,john,henry,1986-01-23
jones,arya,cat,1986-02-19

Here's that data gzipped, then base64 encoded.

H4sICBqITloAA3NpbXBsZUNTVi5jc3YALclBCoAgEIXhvWeZgbSI3Eb7zjCmoWEjjG66fQZt3g/v
y1QbnEn63sn7HGDbV1Xv1CJIcUEaOCH2hUHbZcFhRDOpq0Su/foKMbA8n844aDRjVw4VSB6Cg9ov
BrVVL2G135RuAAAA

The schema that describes the CSV data without the stream transforms is this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"; 
xmlns:fn="http://www.w3.org/2005/xpath-functions";
  xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"; xmlns:ex="http://example.com";
  targetNamespace="http://example.com"; elementFormDefault="unqualified">

  <xs:include schemaLocation="built-in-formats.xsd" />

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
      <dfdl:format ref="ex:daffodilTest1" separator="" initiator=""
        terminator="" leadingSkip='0' textTrimKind="none" initiatedContent="no"
        alignment="implicit" alignmentUnits="bits" trailingSkip="0" 
ignoreCase="no"
        separatorPosition="infix" occursCountKind="implicit"
        emptyValueDelimiterPolicy="both" representation="text" 
textNumberRep="standard"
        lengthKind="delimited" encoding="ASCII" />
    </xs:appinfo>
  </xs:annotation>

    <xs:element name="file" type="ex:fileType"/>

    <!-- broke this up to provide some resuable types and groups here -->

    <xs:complexType name="fileType">
      <xs:group ref="ex:fileTypeGroup"/>
    </xs:complexType>
    
    <xs:group name="fileTypeGroup">
      <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="postfix">
        <xs:element name="header" minOccurs="0" maxOccurs="1"
          dfdl:occursCountKind="implicit">
          <xs:complexType>
            <xs:sequence dfdl:separator=",">
              <xs:element name="title" type="xs:string" maxOccurs="unbounded" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name="record" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence dfdl:separator=",">
              <xs:element name="item" type="xs:string" maxOccurs="unbounded"
                dfdl:occursCount="{ fn:count(../../header/title) }"
                dfdl:occursCountKind="expression" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:group>

</xs:schema>

We can annotate this schema with additional stream transform information to 
enable it to describe the base64 encoded, compressed data.

One easy way to do this is by modifying the complex type definition for 
fileType to this:

<xs:complexType name="fileType">
  <xs:sequence daf:streamRef="ex:base64">
    <xs:sequence daf:streamRef="ex:gzip">
      <xs:group ref="ex:fileTypeGroup"/>
    </xs:sequence>
  </xs:sequence>
</xs:complexType>

Along with that we need the definitions of these named stream formats:

<daf:defineStream name="base64">
   <daf:stream transform="base64" lengthKind="implicit" />
</daf:defineStream>

<daf:defineStream name="gzip">
   <daf:stream transform="gzip" lengthKind="implicit"/>
</daf:defineStream>

These transforms, with lengthKind implicit, are assumed to behave as 
"self-delimiting" meaning they know how much data to consume. 

If there were extra bytes of data added on the end, then not all data would be 
consumed by gzip. There would be leftover data. (Whether or not that is an 
error is application dependent - depends on the API used and whether it assumes 
the whole-stream will be consumed or not.)

Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Reply via email to