Re: Avoiding use of Java CharsetDecoders

Mike Beckerle Wed, 06 Jun 2018 10:06:33 -0700

I am easily convinced that the complexity of maintaining maybe 10 or 20 of our 
own charset decoder/encoders is less than the current complexity of 
force-fitting Java's CharsetDecoder/Encoder into what we need.



I also think this architecture has the potential to be much faster than the 
existing one which has lots of overhead due to the way it has to bend-over 
backwards to accomodate Java's CharsetEncoder/Decoders.


I think the requirements for dfdl:encodingErrorPolicy and non-byte-wide charset 
encodings, and non-byte-aligned character code units, and the need for precise 
maintenance of the bit-position because non-character data can be arbitrarily 
mixed with binary data,... these things taken together do motivate a 
fundamental change in the I/O architecture around charsets.


A few thoughts:


IBMers in the DFDL workgroup have suggested a desire for DFDL 
character-fallback/substitution features that work like the ICU libraries do 
when decoding. This was needed by a customer of theirs, I think they said in 
Japan, so dealing with a Japanese-oriented charset encoding probably.  I don't 
think this disqualifies your proposal in any way, it's really isolated to what 
happens when a substitution character is to be produced and the way the 
charsets are defined for the new Daffodil-specific decoder/encoders. Just one 
more small thing to think about when we design how charsets are 
defined/specified for the new daffodil-specific encoders/decoders.


In your proposed API the method that decodes a single character would 
traditionally be called "read()", but my personal preference would be for it to 
be named decodeOne(...) to make it clear it decodes a single character.


The state access and restore methods should return/take a DecoderState which 
can just be a marker trait, but not an AnyRef.


We use regex libraries. Those operate on CharSequence, which has an API that 
guarantees finite-ness of the sequence. I.e., it always has a finite length.  
So one thing to consider is making DataInputStream implement CharSequence. A 
temporary setRegexLengthLimit(N) method might be needed to avoid the regex 
library misbehaving by pre-buffering the entire input. But lots of overheads 
and complexity might be avoided or better-isolated this way.


I think the surrogate-pairs problem is solved by the ability to save and 
restore the state of a decoder. I think quite literally there would be no code 
for surrogate pairs outside of the definition of the UTF-8 encoder/decoder.  I 
really like having this issue not impacting the code so much given what an 
utterly obscure corner case it is.


-mike beckerle







________________________________
From: Steve Lawrence <[email protected]>
Sent: Wednesday, June 6, 2018 12:02:59 PM
To: [email protected]
Subject: Avoiding use of Java CharsetDecoders

I've been working on the implementation of a bucketing data
structure/algorithm for the DataInputStream to better support
streaming data for a while now. It's mostly complete, but I've found
that the biggest hurdle to integrating it into our DataInputStream and
having something clean and maintainable is the Java ChartsetDecoders.
While the Java CharsetDecoder API is pretty nice for general usage,
for our specific use case, it causes lots of headaches and
difficulties. Some of the issues that make things difficult for our
use are listed below:

1. We must convert our DataInputStream to one or more
   ByteBuffers. This isn't difficult, but does make for more complex
   code and lots of edge cases to handle. And sometimes it requires
   copying large chunks of data to create ByteBuffers, which should be
   avoided.

2. ByteBuffers make it difficult to track how many bytes were decoded,
   as in the case with non-byte size encodings, or with replacement
   characters for encoding errors. And is especially
   difficult/inefficient when trying to figure out how many bytes
   represented the match of a regular expression. Knowing how many
   bytes decoded to a character is necessary to maintain our internal
   bit position state.

3. CharsetDecoders can contain private state. This means we often need
   to be conservative about how decoders are reset to ensure previous
   state does not affect decoding.

4. It is difficult to decode just a single character without messy
   edge cases due to surrogate pairs.

5. CharsetDecoders have a rigid implementation. Many methods are
   final or private, making it difficult/impossible to make changes to
   decode behavior.

Because of these reasons, I propose as part of the streaming/bucketing
changes, we also create our own decoders. The benefits from this are:

1. Decoders can be made to read directly from the underlying
   DataInputStream source. This means there is no need for
   ByteBuffers, or worrying about overflow or underflow. Decoders just
   attempt to get data just like the rest of our DataInputStream
   numeric methods, and either there are enough bits to decode or
   there aren't. There isn't any complications about potentially more
   data allowing things to decode properly.

2. Since Decoders have direct access the DataInputStream, they can
   also update information such as bit position after a successful
   decode of a character. This avoids the need to keep track of byte
   buffer positions and calculate how many bytes were read.

3. Greatly simplifies decoding non-byte size charsets. Rather than
   having to keep track of bit offsets and limits in a byte oriented
   Java CharsetDecoder, these decoders can just ask for N bits from
   the DataInputStream using existing numeric logic and convert them
   to a character.

4. The most common operation is decoding of a single character for
   delimiter scanning. We can design our decoders to make it easy to
   decode a single character, and simplify edge cases like surrogate
   pairs. This can also have knowledge about the dfdl:utf16Width
   property and what a 'single character' means in a UTF-16 charset.

5. We can make any decoder state be exportable into Marks, removing
   the need for frequent resets() and cleanly integrate into our
   backtracking system.

6. Completely controlled by us, so we can extend and manipulate the
   API as necessary.

The biggest drawback is really that we now need to implement and
maintain decoders, and that we no longer get standard Java decoders
for free. If someone needs a decoder, we need to implement it.
Fortunately, DFDL v1.0 only requires ASCII, ISO-8859-1, and UTF
encodings, and a handful of others should cover the vast majority of
our use cases.

As far as a simple API, I imagined something like this:

  class DaffodilCharsetDecoder {

    // Decode a single character from the DataInputStream. If there is
    // not enough data or there is an encoding error, then return
    // Nope. Note that if dfdl:encodingErrorPolicy="replace" and there
    // is an encoding error, this will return the replacement
    // character rather than Nope. This should set the bitPosition to
    // the end. The bitPosition should be set to where the end of
    // decoding the character finished.
    def decode(dis: DataInputStream, fmt: FormatInfo): Maybe[Char]

    // Staring at a given offset, decodes up to chars.length - offset
    // characters into the chars array. The lengths array must be the
    // same size as the chars array. Each index in the lengths array
    // is the number of bytes read up to that char. This is useful in
    // cases like regular expression scanning where a large block of
    // characters are decoded first and then matched against. By
    // determining how many characters were matched, we can quickly
    // determine how many bytes represented the matched characters.
    // Returns the number of characters decoded. The bitPosition
    // should be set to where the end of decoding the character
    // finished.
    def decode(dis: DataInputStream, fmt: FormatInfo, offset: Int,
chars: Array[Char], lengths: Array[Int]): Int

    // Create a copy of any internal state and return it. May return
    // null if no internal state is maintained
    def getState(): AnyRef

    // Restore a copy of internal state. When restored, it is safe to
    // assume that state of the DataInputStream has been restored to
    // that at the time getState was called. state may be null if no
    // internal state is maintained.
    def setState(state: AnyRef): Unit
  }

So one method to decode a single char for delimiter scanning, one
method to decode a block of characters with bit lengths of the decode,
and getter/setters for the state if needed.

One potential issue with this is the use of Char's. Some care may
still need to be taken to handle surrogate pairs, though it should be
cleaner since we have a function dedicated to returning a single char.
One alternative is to change the decode methods to return Int
code points rather than Char's. This completely removes special casing
of surrogate pairs, but would likely require changing much of
Daffodil to use Int code points instead of Char's, with new functions
to convert arrays of code points to strings when being displayed or
added to the infoset. This would also double the storage for strings 4
bytes per char instead of 2). So I'm not sure it's worth the effort to
completely be rid of surrogate pairs.

Thoughts on the general idea/approach?

Re: Avoiding use of Java CharsetDecoders

Reply via email to