[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

paul-rogers Mon, 12 Sep 2016 20:39:05 -0700

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/518
  
    Upon reflection, it seems that newline is not an adequate marker to 
separate JSON records. Many of our samples have internal newlines. If a newline 
appears inside the JSON record, then we are subject to the same incorrect 
recovery as illustrated with the "a, x, bar, y" example in the earlier comment.
    
    Further, if the JSON tokenizer is like most, it probably discards 
whitespace, not returning EOL as a token.
    
    So, it seems that the best (or only) option is to scan for the "} {" pair. 
This requires two specific improvements:
    
    * A "token discarder" that uses a state machine to look for the "} {" 
pairs, and
    * An indirection around the get-token method so we can push the "{" token 
back onto the input.
    
    These changes, along with the pseudo-code shown earlier may provide as good 
a solution as we can get. (Phrased that way because some errors will cause two 
records to be discarded, as explained earlier.) Combine that with the options 
and error reporting from the original pull request and we are probably pretty 
close.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

Reply via email to