This email is to start a discussion of features to enable DFDL to express more
data formats - particularly those that use some form of encoding (not charset
encoding, algorithmic encoding) of part or all of the data.
IETF data formats make extensive use of base64 encoding of binary data for
inclusion in textual data.
In addition the textual formats make use of line-folding (A line longer than 72
characters is extended on the next line by beginning the next line with a space
(or tab? not sure).
There are many other schemes where part of a data representation has to be
algorithmically decoded before the DFDL parsing can process it.
A good example comes from the MIL-STD-2045 message header format. This header
has flags that indicate whether the message contents is to be compressed, and
with what compression algorithm. Parsing needs to choose among several
algorithms based on values computed from the data. Unparsing similarly must
determine which compression algorithm to use to compress the message contents.
Our plan in implementing this feature in Daffodil would be to gain experience
with it, and such time as we're satisfied with it, propose the feature for
inclusion in a future revision of the DFDL standard.
Perhaps there is a better name, but for this email we'll use the property
dfdl:transferEncoding. This term comes from MIME where data can be transported
encoded in a content transfer encoding designed to protect binary data from
corruption, etc.
What is proposed is:
dfdl:transferEncoding takes a whitespace separated list of transfer encoding
names. The empty string means no transfer encoding will be used. An expression
can be used to evaluate to the whitespace separated list, or to the empty
string.
A transfer encoding name identifies a transfer encoding algorithm. This
algorithm can be
* bytes to bytes - example compress
* bytes to text - TBD (needed?)
* text to bytes - example base64, AIS
* text to text - TBD (needed?)
The whitespace separated list must be of compatible transfer encoding
algorithms. The first named algorithm is applied first, so assuming these
identifiers are valid dfdl:transferEncoding="base64 zip" would mean the data is
text, and will be converted from text to bytes by the base64 decoder, and then
from bytes to bytes by the unzip decoder. The inverse happens when unparsing.
When a DFDL element has a dfdl:transferEncoding, then the length of that
element is the length of the transfer- encoded representation of the data.
For example: An element of complex type can have a prefixed length indicating
it is 16457 bytes long. If its
transfer encoding specifies zip compression, then this 16457 bytes would be
unzipped and the result would be larger. For example it could expand to 50873
bytes. The content of the complex type would then be parsed from this 50873
bytes.
The implementation of transfer encodings generally involves Daffodil's parser
and unparser combinators.
Considering first parsing. The combinator would take action before and after
parsing the content of the element. In the before action, the Daffodil
DataInputStream would be encapsulated by another implementation of
DataInputStream; except that this encapsulating stream would implement the
transfer encoding decoder algorithm, reading data from the underlying
DataInputStream. Multiple transfer encodings would result in multiple such
encapsulations layered one upon the other.
After the content is unparsed, the action taken after by the combinator is to
unencapsulate the DataInputStream, returning to the original DataInputStream,
from which some data will have been consumed.
The position of the original DataInputStream must be precise and exactly the
position after the last bit of the transfer-encoded data.
Some formats will require nested elements such that an outer element having a
transfer encoding specified can have a text dfdl:encoding property specifying
the text charset used in the transfer-encoded representation. The inner nested
element can then have a different dfdl:encoding property - which is used to
interpret the decoded data as text. For example suppose you have a large text
string in UTF-8. This can be compressed to get bytes, and those bytes base64
encoded into the US-ASCII charset. This would be expressed by something like
<element name="outer" dfdl:encoding="us-ascii" dfdl:transferEncoding="base64
compress">
<complexType>
<sequence>
<element name="inner" type="xs:string" dfdl:encoding="utf-8"
dfdl:lengthKind="delimited"/>
....
About extensibility
It was a goal for this set of transfer encodings to be readily extensible. This
is because many formats have specific encodings particular to them. AIS has
one, ASN.1 BER has one (so called "object" encoding), and there are a wide
variety of compression algorithms.
However, it is probably best to build some of these transfer encoders/decoders
first, and then consider what is necessary to specify one without access to
Daffodil internal classes and data structures.
About MIME names for encodings.
TBD: identifiers like base64 mean different things in different contexts. In
the XML world it is just an algorithm for creating a single long string of
characters. (Much like how hexBinary means a single long string of hex digits).
But in IETF Internet Message Format, base64 means a particular syntax with
lines of a specific length. An IMF base64 encoded binary has a block structure
with human-tolerable line-lengths (max 72) and a specific introduction and
termination to indicate the start/end.
Perhaps use QNames so that ietf:base64 or mime:base64 can provide the
distinctions using normal namespace qualification.
TBD: parameters to transfer encoding algorithms.
We may need some way to express these. Perhaps a URL-style thing like
dfdl:transferEncoding='compress?method=bz2'
...mike beckerle