Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/518 Thanks much for your contribution! Sorry it is taking a while to review the code. I've read over the code changes a couple of times. The options part is fine. However, the parser part seems a bit too complex and may not handle all cases of interest. I worry that the code may only handle the error shown in the test input file: a missing close quote: { "foo" : "bar } { "foo" : "mumble" } To do a more general recovery, we have to have some known good recovery point. Pure JSON is not suited to recovery. Fortunately, the JSON we read is not true JSON, rather it is a file of JSON objects. The file-of-json-object structure offers two possible recovery points. If we require that each JSON record start on a new line, then we can use newline as our recovery point: { "foo" :: 10 } { "foo" : 20 } In the above, the first record is badly formed. But, we know where the next record begins because of the newline separator. It seems that the code may be using the above approach. But, since the only test case is a missing close quote, the parser will "naturally" read all tokens up to EOL. But, the implementation seems to not handle the double-colon example: the paser will fail on the second token. When resuming, the parser will see another colon and fail again. Maybe the parser would eventually fail enough to find the EOL. But, we can construct cases where things would fail: { "foo" :: 10, "x": { "bar", 20, "mumble" } } If we don't discard tokens to the newline, the parser may decide that { "bar" ... starts a new valid record, with the result of causing a schema change when not needed. The message here is that, on error, we must discard tokens to EOL, we can't just try to resume parsing at the point of failure. If, however, we can't count on the newline, then we need some other syntactic trick. Perhaps we can look for "}/s*{" (that is, close bracket, optional white space, open bracket): { "foo" :: { "foo" : 20 } { "foo" : 30 } Here, we'd actually discard both the first and second records because we'd only find the recovery patttern between the second and third records. (The "} {" pattern can never occur inside a JSON object, it would have to be "}, {" instead.) Using EOL as delimiter is easiest: we don't need lookahead. But, we require that a newline separate records. (Newlines within records can be handled, but we'll omit that for now...) Using "}/s*{" as the recovery is a bit harder: we need to read ahead by one token, then push the token back on the input. If we do the above, then the reader algorithm should look something like this: while ( more records to read ) { try { read the record } catch ( parse exception ) { read and discard tokens until we get to an EOL (or the "}/s*{") reset the value vector pointer back one step to discard any partially loaded values. } } The above can be shown to be correct by working through a simple state machine and set of examples. Alternatively, it might help to include here (or in comments in code) an explanation of the proposed recovery mechanism so it is easier for reviewers to verify correctness.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---