I am easily convinced that the complexity of maintaining maybe 10 or 20 of our own charset decoder/encoders is less than the current complexity of force-fitting Java's CharsetDecoder/Encoder into what we need.
I also think this architecture has the potential to be much faster than the existing one which has lots of overhead due to the way it has to bend-over backwards to accomodate Java's CharsetEncoder/Decoders. I think the requirements for dfdl:encodingErrorPolicy and non-byte-wide charset encodings, and non-byte-aligned character code units, and the need for precise maintenance of the bit-position because non-character data can be arbitrarily mixed with binary data,... these things taken together do motivate a fundamental change in the I/O architecture around charsets. A few thoughts: IBMers in the DFDL workgroup have suggested a desire for DFDL character-fallback/substitution features that work like the ICU libraries do when decoding. This was needed by a customer of theirs, I think they said in Japan, so dealing with a Japanese-oriented charset encoding probably. I don't think this disqualifies your proposal in any way, it's really isolated to what happens when a substitution character is to be produced and the way the charsets are defined for the new Daffodil-specific decoder/encoders. Just one more small thing to think about when we design how charsets are defined/specified for the new daffodil-specific encoders/decoders. In your proposed API the method that decodes a single character would traditionally be called "read()", but my personal preference would be for it to be named decodeOne(...) to make it clear it decodes a single character. The state access and restore methods should return/take a DecoderState which can just be a marker trait, but not an AnyRef. We use regex libraries. Those operate on CharSequence, which has an API that guarantees finite-ness of the sequence. I.e., it always has a finite length. So one thing to consider is making DataInputStream implement CharSequence. A temporary setRegexLengthLimit(N) method might be needed to avoid the regex library misbehaving by pre-buffering the entire input. But lots of overheads and complexity might be avoided or better-isolated this way. I think the surrogate-pairs problem is solved by the ability to save and restore the state of a decoder. I think quite literally there would be no code for surrogate pairs outside of the definition of the UTF-8 encoder/decoder. I really like having this issue not impacting the code so much given what an utterly obscure corner case it is. -mike beckerle ________________________________ From: Steve Lawrence <[email protected]> Sent: Wednesday, June 6, 2018 12:02:59 PM To: [email protected] Subject: Avoiding use of Java CharsetDecoders I've been working on the implementation of a bucketing data structure/algorithm for the DataInputStream to better support streaming data for a while now. It's mostly complete, but I've found that the biggest hurdle to integrating it into our DataInputStream and having something clean and maintainable is the Java ChartsetDecoders. While the Java CharsetDecoder API is pretty nice for general usage, for our specific use case, it causes lots of headaches and difficulties. Some of the issues that make things difficult for our use are listed below: 1. We must convert our DataInputStream to one or more ByteBuffers. This isn't difficult, but does make for more complex code and lots of edge cases to handle. And sometimes it requires copying large chunks of data to create ByteBuffers, which should be avoided. 2. ByteBuffers make it difficult to track how many bytes were decoded, as in the case with non-byte size encodings, or with replacement characters for encoding errors. And is especially difficult/inefficient when trying to figure out how many bytes represented the match of a regular expression. Knowing how many bytes decoded to a character is necessary to maintain our internal bit position state. 3. CharsetDecoders can contain private state. This means we often need to be conservative about how decoders are reset to ensure previous state does not affect decoding. 4. It is difficult to decode just a single character without messy edge cases due to surrogate pairs. 5. CharsetDecoders have a rigid implementation. Many methods are final or private, making it difficult/impossible to make changes to decode behavior. Because of these reasons, I propose as part of the streaming/bucketing changes, we also create our own decoders. The benefits from this are: 1. Decoders can be made to read directly from the underlying DataInputStream source. This means there is no need for ByteBuffers, or worrying about overflow or underflow. Decoders just attempt to get data just like the rest of our DataInputStream numeric methods, and either there are enough bits to decode or there aren't. There isn't any complications about potentially more data allowing things to decode properly. 2. Since Decoders have direct access the DataInputStream, they can also update information such as bit position after a successful decode of a character. This avoids the need to keep track of byte buffer positions and calculate how many bytes were read. 3. Greatly simplifies decoding non-byte size charsets. Rather than having to keep track of bit offsets and limits in a byte oriented Java CharsetDecoder, these decoders can just ask for N bits from the DataInputStream using existing numeric logic and convert them to a character. 4. The most common operation is decoding of a single character for delimiter scanning. We can design our decoders to make it easy to decode a single character, and simplify edge cases like surrogate pairs. This can also have knowledge about the dfdl:utf16Width property and what a 'single character' means in a UTF-16 charset. 5. We can make any decoder state be exportable into Marks, removing the need for frequent resets() and cleanly integrate into our backtracking system. 6. Completely controlled by us, so we can extend and manipulate the API as necessary. The biggest drawback is really that we now need to implement and maintain decoders, and that we no longer get standard Java decoders for free. If someone needs a decoder, we need to implement it. Fortunately, DFDL v1.0 only requires ASCII, ISO-8859-1, and UTF encodings, and a handful of others should cover the vast majority of our use cases. As far as a simple API, I imagined something like this: class DaffodilCharsetDecoder { // Decode a single character from the DataInputStream. If there is // not enough data or there is an encoding error, then return // Nope. Note that if dfdl:encodingErrorPolicy="replace" and there // is an encoding error, this will return the replacement // character rather than Nope. This should set the bitPosition to // the end. The bitPosition should be set to where the end of // decoding the character finished. def decode(dis: DataInputStream, fmt: FormatInfo): Maybe[Char] // Staring at a given offset, decodes up to chars.length - offset // characters into the chars array. The lengths array must be the // same size as the chars array. Each index in the lengths array // is the number of bytes read up to that char. This is useful in // cases like regular expression scanning where a large block of // characters are decoded first and then matched against. By // determining how many characters were matched, we can quickly // determine how many bytes represented the matched characters. // Returns the number of characters decoded. The bitPosition // should be set to where the end of decoding the character // finished. def decode(dis: DataInputStream, fmt: FormatInfo, offset: Int, chars: Array[Char], lengths: Array[Int]): Int // Create a copy of any internal state and return it. May return // null if no internal state is maintained def getState(): AnyRef // Restore a copy of internal state. When restored, it is safe to // assume that state of the DataInputStream has been restored to // that at the time getState was called. state may be null if no // internal state is maintained. def setState(state: AnyRef): Unit } So one method to decode a single char for delimiter scanning, one method to decode a block of characters with bit lengths of the decode, and getter/setters for the state if needed. One potential issue with this is the use of Char's. Some care may still need to be taken to handle surrogate pairs, though it should be cleaner since we have a function dedicated to returning a single char. One alternative is to change the decode methods to return Int code points rather than Char's. This completely removes special casing of surrogate pairs, but would likely require changing much of Daffodil to use Int code points instead of Char's, with new functions to convert arrays of code points to strings when being displayed or added to the infoset. This would also double the storage for strings 4 bytes per char instead of 2). So I'm not sure it's worth the effort to completely be rid of surrogate pairs. Thoughts on the general idea/approach?
