I've been working on the implementation of a bucketing data
structure/algorithm for the DataInputStream to better support
streaming data for a while now. It's mostly complete, but I've found
that the biggest hurdle to integrating it into our DataInputStream and
having something clean and maintainable is the Java ChartsetDecoders.
While the Java CharsetDecoder API is pretty nice for general usage,
for our specific use case, it causes lots of headaches and
difficulties. Some of the issues that make things difficult for our
use are listed below:
1. We must convert our DataInputStream to one or more
ByteBuffers. This isn't difficult, but does make for more complex
code and lots of edge cases to handle. And sometimes it requires
copying large chunks of data to create ByteBuffers, which should be
avoided.
2. ByteBuffers make it difficult to track how many bytes were decoded,
as in the case with non-byte size encodings, or with replacement
characters for encoding errors. And is especially
difficult/inefficient when trying to figure out how many bytes
represented the match of a regular expression. Knowing how many
bytes decoded to a character is necessary to maintain our internal
bit position state.
3. CharsetDecoders can contain private state. This means we often need
to be conservative about how decoders are reset to ensure previous
state does not affect decoding.
4. It is difficult to decode just a single character without messy
edge cases due to surrogate pairs.
5. CharsetDecoders have a rigid implementation. Many methods are
final or private, making it difficult/impossible to make changes to
decode behavior.
Because of these reasons, I propose as part of the streaming/bucketing
changes, we also create our own decoders. The benefits from this are:
1. Decoders can be made to read directly from the underlying
DataInputStream source. This means there is no need for
ByteBuffers, or worrying about overflow or underflow. Decoders just
attempt to get data just like the rest of our DataInputStream
numeric methods, and either there are enough bits to decode or
there aren't. There isn't any complications about potentially more
data allowing things to decode properly.
2. Since Decoders have direct access the DataInputStream, they can
also update information such as bit position after a successful
decode of a character. This avoids the need to keep track of byte
buffer positions and calculate how many bytes were read.
3. Greatly simplifies decoding non-byte size charsets. Rather than
having to keep track of bit offsets and limits in a byte oriented
Java CharsetDecoder, these decoders can just ask for N bits from
the DataInputStream using existing numeric logic and convert them
to a character.
4. The most common operation is decoding of a single character for
delimiter scanning. We can design our decoders to make it easy to
decode a single character, and simplify edge cases like surrogate
pairs. This can also have knowledge about the dfdl:utf16Width
property and what a 'single character' means in a UTF-16 charset.
5. We can make any decoder state be exportable into Marks, removing
the need for frequent resets() and cleanly integrate into our
backtracking system.
6. Completely controlled by us, so we can extend and manipulate the
API as necessary.
The biggest drawback is really that we now need to implement and
maintain decoders, and that we no longer get standard Java decoders
for free. If someone needs a decoder, we need to implement it.
Fortunately, DFDL v1.0 only requires ASCII, ISO-8859-1, and UTF
encodings, and a handful of others should cover the vast majority of
our use cases.
As far as a simple API, I imagined something like this:
class DaffodilCharsetDecoder {
// Decode a single character from the DataInputStream. If there is
// not enough data or there is an encoding error, then return
// Nope. Note that if dfdl:encodingErrorPolicy="replace" and there
// is an encoding error, this will return the replacement
// character rather than Nope. This should set the bitPosition to
// the end. The bitPosition should be set to where the end of
// decoding the character finished.
def decode(dis: DataInputStream, fmt: FormatInfo): Maybe[Char]
// Staring at a given offset, decodes up to chars.length - offset
// characters into the chars array. The lengths array must be the
// same size as the chars array. Each index in the lengths array
// is the number of bytes read up to that char. This is useful in
// cases like regular expression scanning where a large block of
// characters are decoded first and then matched against. By
// determining how many characters were matched, we can quickly
// determine how many bytes represented the matched characters.
// Returns the number of characters decoded. The bitPosition
// should be set to where the end of decoding the character
// finished.
def decode(dis: DataInputStream, fmt: FormatInfo, offset: Int,
chars: Array[Char], lengths: Array[Int]): Int
// Create a copy of any internal state and return it. May return
// null if no internal state is maintained
def getState(): AnyRef
// Restore a copy of internal state. When restored, it is safe to
// assume that state of the DataInputStream has been restored to
// that at the time getState was called. state may be null if no
// internal state is maintained.
def setState(state: AnyRef): Unit
}
So one method to decode a single char for delimiter scanning, one
method to decode a block of characters with bit lengths of the decode,
and getter/setters for the state if needed.
One potential issue with this is the use of Char's. Some care may
still need to be taken to handle surrogate pairs, though it should be
cleaner since we have a function dedicated to returning a single char.
One alternative is to change the decode methods to return Int
code points rather than Char's. This completely removes special casing
of surrogate pairs, but would likely require changing much of
Daffodil to use Int code points instead of Char's, with new functions
to convert arrays of code points to strings when being displayed or
added to the infoset. This would also double the storage for strings 4
bytes per char instead of 2). So I'm not sure it's worth the effort to
completely be rid of surrogate pairs.
Thoughts on the general idea/approach?