BitsCharsetDecoder.scala has a section for handling decode errors, but has logic commented out and replaced with a NotYetImplemented assertion. I think it's just a matter of having this section throw an encoding exception and the caller code handling it appropriately.
There are five callers in InputSourceDataInputStream of the decode() method, so I suspect those will all need to be updated to handle the exception and do the right thing, which might be just be to let it bubble up to the parsers. However, I think there are some subtleties that make decoder scanning more difficult. For example, delimiter scanning performs lookahead which I don't think should immediately cause a parser error. I think it should only cause a parse error when an invalid character is actually read. So the InputSourceDataInputStreamCharIterator logic probably becomes a bit more complex to handle lookahead decode errors correctly. I haven't put too much thought into this though. And then it's a matter of ensuring the parsers that end up decoding characters also handle that parse error and start to backtrack, since I think many of them currently just assume a call to an IO function that decodes characters will always succeed. So I don't think it's going to be particularly difficult, but there are probably some subtleties in some cases, and we really need to inspect parsers to make sure they are handling it correctly. I agree the unparsing should not be too difficult for the reasons you've provided. - Steve On 10/3/18 5:15 PM, Mike Beckerle wrote: > Turns out IBM DFDL implements only encodingErrorPolicy='error', and Daffodil > only encodingErrorPolicy='replace'. > > > That means for any data where there are encoding errors the two > implementations will not behave the same. > > For compatibility testing, this will be problematic. > > > The I/O layer was recently revised for parsing to use our own decoders. > > > Not sure anything changed about encoders. > > > How hard is implementing parse-time encodingErrorPolicy='error', in Daffodil, > which just raises a parse error if a decode error occurs? > > > I know for unparsing, if we're using java encoders, the implementation of > encodingErrorPolicy='error' just requires initializing all encoders to have > malformed and unmapped error handlers that throw. Then catching this throw > and converting to an unparse error is all that is required. This has little > or no performance implications as unparse errors are fatal. > > >
