Proposal for Implementing Base64, folded lines, quoted-printable, compress/decompress, etc.

Mike Beckerle Thu, 16 Nov 2017 18:40:59 -0800

This email is to start a discussion of features to enable DFDL to express more 
data formats - particularly those that use some form of encoding (not charset 
encoding, algorithmic encoding) of part or all of the data.



IETF data formats make extensive use of base64 encoding of binary data for 
inclusion in textual data.

In addition the textual formats make use of line-folding (A line longer than 72 
characters is extended on the next line by beginning the next line with a space 
(or tab? not sure).


There are many other schemes where part of a data representation has to be 
algorithmically decoded before the DFDL parsing can process it.


A good example comes from the MIL-STD-2045 message header format. This header 
has flags that indicate whether the message contents is to be compressed, and 
with what compression algorithm. Parsing needs to choose among several 
algorithms based on values computed from the data. Unparsing similarly must 
determine which compression algorithm to use to compress the message contents.


Our plan in implementing this feature in Daffodil would be to gain experience 
with it, and such time as we're satisfied with it, propose the feature for 
inclusion in a future revision of the DFDL standard.


Perhaps there is a better name, but for this email we'll use the property 
dfdl:transferEncoding. This term comes from MIME where data can be transported 
encoded in a content transfer encoding designed to protect binary data from 
corruption, etc.


What is proposed is:


dfdl:transferEncoding takes a whitespace separated list of transfer encoding 
names. The empty string means no transfer encoding will be used. An expression 
can be used to evaluate to the whitespace separated list, or to the empty 
string.


A transfer encoding name identifies a transfer encoding algorithm. This 
algorithm can be

  *   bytes to bytes - example compress
  *   bytes to text - TBD (needed?)
  *   text to bytes - example base64, AIS
  *   text to text - TBD (needed?)


The whitespace separated list must be of compatible transfer encoding 
algorithms. The first named algorithm is applied first, so assuming these 
identifiers are valid dfdl:transferEncoding="base64 zip" would mean the data is 
text, and will be converted from text to bytes by the base64 decoder, and then 
from bytes to bytes by the unzip decoder. The inverse happens when unparsing.


When a DFDL element has a dfdl:transferEncoding, then the length of that 
element is the length of the transfer- encoded representation of the data.


For example: An element of complex type can have a prefixed length indicating 
it is 16457 bytes long. If its

transfer encoding specifies zip compression, then this 16457 bytes would be 
unzipped and the result would be larger. For example it could expand to 50873 
bytes. The content of the complex type would then be parsed from this 50873 
bytes.


The implementation of transfer encodings generally involves Daffodil's parser 
and unparser combinators.

Considering first parsing. The combinator would take action before and after 
parsing the content of the element. In the before action, the Daffodil 
DataInputStream would be encapsulated by another implementation of 
DataInputStream; except that this encapsulating stream would implement the 
transfer encoding decoder algorithm, reading data from the underlying 
DataInputStream. Multiple transfer encodings would result in multiple such 
encapsulations layered one upon the other.


After the content is unparsed, the action taken after by the combinator is to 
unencapsulate the DataInputStream, returning to the original DataInputStream, 
from which some data will have been consumed.


The position of the original DataInputStream must be precise and exactly the 
position after the last bit of the transfer-encoded data.


Some formats will require nested elements such that an outer element having a 
transfer encoding specified can have a text dfdl:encoding property specifying 
the text charset used in the transfer-encoded representation. The inner nested 
element can then have a different dfdl:encoding property - which is used to 
interpret the decoded data as text.  For example suppose you have a large text 
string in UTF-8. This can be compressed to get bytes, and those bytes base64 
encoded into the US-ASCII charset. This would be expressed by something like


<element name="outer" dfdl:encoding="us-ascii" dfdl:transferEncoding="base64 
compress">

   <complexType>

     <sequence>

       <element name="inner" type="xs:string" dfdl:encoding="utf-8" 
dfdl:lengthKind="delimited"/>

   ....


About extensibility


It was a goal for this set of transfer encodings to be readily extensible. This 
is because many formats have specific encodings particular to them. AIS has 
one, ASN.1 BER has one (so called "object" encoding), and there are a wide 
variety of compression algorithms.


However, it is probably best to build some of these transfer encoders/decoders 
first, and then consider what is necessary to specify one without access to 
Daffodil internal classes and data structures.


About MIME names for encodings.


TBD: identifiers like base64 mean different things in different contexts. In 
the XML world it is just an algorithm for creating a single long string of 
characters. (Much like how hexBinary means a single long string of hex digits).

But in IETF Internet Message Format, base64 means a particular syntax with 
lines of a specific length. An IMF base64 encoded binary has a block structure 
with human-tolerable line-lengths (max 72) and a specific introduction and 
termination to indicate the start/end.


Perhaps use QNames so that ietf:base64 or mime:base64 can provide the 
distinctions using normal namespace qualification.


TBD: parameters to transfer encoding algorithms.


We may need some way to express these. Perhaps a URL-style thing like

dfdl:transferEncoding='compress?method=bz2'


...mike beckerle

Proposal for Implementing Base64, folded lines, quoted-printable, compress/decompress, etc.

Reply via email to