Re: how hard is encodingErrorPolicy='error' to implement?

Mike Beckerle Mon, 08 Oct 2018 09:31:32 -0700

Yeah I recall the reason we do replace is that one needs to not error on look 
ahead decoding.

What I had done in the old code, and I think had unit tests for is the old code 
was careful to return short on decode error, i.e. you just got delivered the 
characters up to the error., as if no error had occurred. So the caller has to 
consume all chars up to the error and then call again to insist on more data 
before the error would be thrown, and not masked as just being part of 
lookahead. That was the intent anyway.  The notion here is that an I/O call 
that asks for N bytes of data always has to be prepared to accept less than N 
on return, and in case of decode errors and encodingErrorPolicy="error", return 
fewer than N (but at least one) up to the point of the error, and mask the 
error. If  the I/O layer cannot return even 1 non-error character, then 
propagate the error as a decode error.

The DFDL spec isn't specific enough here about the exact requirement. It 
doesn't make clear that one must not issues spurious errors due to pre-fetching 
and pre-decoding that happens to pre-fetch past the end of the text and so 
encounters binary data and spurious decode errors.  I've sent email to DFDL 
workgroup for clarification of this.

It does specify that for asserts/discriminators with testKind pattern, that the 
regular expression can result in scanning for characters and that decode errors 
can occur and are handled as per encodingErrorPolicy, so a regex for such an 
assert/discriminator must be designed with decode errors in mind. However, if 
the resulting pattern match of the regex is much shorter than what was buffered 
and pre-decoded (for efficiency reasons), the DFDL spec is again unclear about 
whether an encoding error should be issued or not.

For other parsing situations, the DFDL spec is specific to say that if 
lengthUnits='bytes' then if there aren't enough bytes to hold the 
representation of a character, then on parse the bytes are skipped, and on 
unparse they're filled with the fillByte. For lengthUnits='characters', such 
fragments of characters are errors subject to encodingErrorPolicy.

________________________________
From: Steve Lawrence <[email protected]>
Sent: Monday, October 8, 2018 7:26:28 AM
To: [email protected]; Mike Beckerle
Subject: Re: how hard is encodingErrorPolicy='error' to implement?

BitsCharsetDecoder.scala has a section for handling decode errors, but
has logic commented out and replaced with a NotYetImplemented assertion.
I think it's just a matter of having this section throw an encoding
exception and the caller code handling it appropriately.

There are five callers in InputSourceDataInputStream of the decode()
method, so I suspect those will all need to be updated to handle the
exception and do the right thing, which might be just be to let it
bubble up to the parsers.

However, I think there are some subtleties that make decoder scanning
more difficult. For example, delimiter scanning performs lookahead which
I don't think should immediately cause a parser error. I think it should
only cause a parse error when an invalid character is actually read. So
the InputSourceDataInputStreamCharIterator logic probably becomes a bit
more complex to handle lookahead decode errors correctly. I haven't put
too much thought into this though.

And then it's a matter of ensuring the parsers that end up decoding
characters also handle that parse error and start to backtrack, since I
think many of them currently just assume a call to an IO function that
decodes characters will always succeed.

So I don't think it's going to be particularly difficult, but there are
probably some subtleties in some cases, and we really need to inspect
parsers to make sure they are handling it correctly.

I agree the unparsing should not be too difficult for the reasons you've
provided.

- Steve

On 10/3/18 5:15 PM, Mike Beckerle wrote:
> Turns out IBM DFDL implements only encodingErrorPolicy='error', and Daffodil 
> only encodingErrorPolicy='replace'.
>
>
> That means for any data where there are encoding errors the two 
> implementations will not behave the same.
>
> For compatibility testing, this will be problematic.
>
>
> The I/O layer was recently revised for parsing to use our own decoders.
>
>
> Not sure anything changed about encoders.
>
>
> How hard is implementing parse-time encodingErrorPolicy='error', in Daffodil, 
> which just raises a parse error if a decode error occurs?
>
>
> I know for unparsing, if we're using java encoders, the implementation of 
> encodingErrorPolicy='error' just requires initializing all encoders to have 
> malformed and unmapped error handlers that throw. Then catching this throw 
> and converting to an unparse error is all that is required. This has little 
> or no performance implications as unparse errors are fatal.
>
>
>

Re: how hard is encodingErrorPolicy='error' to implement?

Reply via email to