[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

paul-rogers Mon, 12 Sep 2016 18:07:41 -0700

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/518
  
    Thanks much for your contribution! Sorry it is taking a while to review the 
code.
    
    I've read over the code changes a couple of times. The options part is 
fine. However, the parser part seems a bit too complex and may not handle all 
cases of interest. I worry that the code may only handle the error shown in the 
test input file: a missing close quote:
    
    { "foo" : "bar }
    { "foo" : "mumble" }
    
    To do a more general recovery, we have to have some known good recovery 
point. Pure JSON is not suited to recovery. Fortunately, the JSON we read is 
not true JSON, rather it is a file of JSON objects. The file-of-json-object 
structure offers two possible recovery points.
    
    If we require that each JSON record start on a new line, then we can use 
newline as our recovery point:
    
    { "foo" :: 10 }
    { "foo" : 20 }
    
    In the above, the first record is badly formed. But, we know where the next 
record begins because of the newline separator.
    
    It seems that the code may be using the above approach. But, since the only 
test case is a missing close quote, the parser will "naturally" read all tokens 
up to EOL. But, the implementation seems to not handle the double-colon 
example: the paser will fail on the second token. When resuming, the parser 
will see another colon and fail again. Maybe the parser would eventually fail 
enough to find the EOL. But, we can construct cases where things would fail:
    
    { "foo" :: 10, "x": { "bar", 20, "mumble" } }
    
    If we don't discard tokens to the newline, the parser may decide that { 
"bar" ... starts a new valid record, with the result of causing a schema change 
when not needed.
    
    The message here is that, on error, we must discard tokens to EOL, we can't 
just try to resume parsing at the point of failure.
    
    If, however, we can't count on the newline, then we need some other 
syntactic trick. Perhaps we can look for "}/s*{" (that is, close bracket, 
optional white space, open bracket):
    
    { "foo" :: 
    { "foo" : 20 }
    { "foo" : 30 }
    
    Here, we'd actually discard both the first and second records because we'd 
only find the recovery patttern between the second and third records. (The "} 
{" pattern can never occur inside a JSON object, it would have to be "}, {" 
instead.)
    
    Using EOL as delimiter is easiest: we don't need lookahead. But, we require 
that a newline separate records. (Newlines within records can be handled, but 
we'll omit that for now...)
    
    Using "}/s*{" as the recovery is a bit harder: we need to read ahead by one 
token, then push the token back on the input.
    
    If we do the above, then the reader algorithm should look something like 
this:
    
        while ( more records to read ) {
            try {
                read the record
            } catch ( parse exception ) {
                read and discard tokens until we get to an EOL (or the "}/s*{")
                reset the value vector pointer back one step to discard any 
partially loaded values.
            }
        }
    
    The above can be shown to be correct by working through a simple state 
machine and set of examples.
    
    Alternatively, it might help to include here (or in comments in code) an 
explanation of the proposed recovery mechanism so it is easier for reviewers to 
verify correctness.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

Reply via email to