Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Looks like you are right; the JsonParser is more than a simple tokenizer.
We're not the first to try this:
http://stackoverflow.com/questions/37511496/recover-from-malformed-json-with-jackson
(no answer)
I tried an experiment and found that you are on the right track: the way
you are using the JsonParser can be extended to ignore input until the start of
the next object. A quick demonstration:
private static void recover(JsonParser parser) throws IOException {
for ( ; ; ) {
JsonToken token;
try {
token = parser.nextToken();
} catch( JsonParseException e ) { continue; }
if ( token == null ) return;
if ( token != JsonToken.END_OBJECT ) { continue; }
token = parser.nextToken();
if ( token == null ) return;
if ( token == JsonToken.START_OBJECT ) { return; }
}
}
Basically, we keep reading tokens, and ignoring errors, until we
successfully find the } { pair.
As we discussed before, to use the above in Drill, we have to discard the
partly-built record, and start reading the next record assiming the parser is
positioned **after** the START_OBJECT ("{") token, which we've already
consumed. That should be simple.
Still, to do proper recovery, we have to discard the partly-built JSON
record. I've not looked into how to do that. If we don't do that, we return the
bogus partly-built record. Worse, if we recover by trying to build a new
record, we create more partly-built records, but with a different schema,
possibly triggering a schema change event when not really necessary.
Any ideas for how to solve that problem?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---