Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/518 Upon reflection, it seems that newline is not an adequate marker to separate JSON records. Many of our samples have internal newlines. If a newline appears inside the JSON record, then we are subject to the same incorrect recovery as illustrated with the "a, x, bar, y" example in the earlier comment. Further, if the JSON tokenizer is like most, it probably discards whitespace, not returning EOL as a token. So, it seems that the best (or only) option is to scan for the "} {" pair. This requires two specific improvements: * A "token discarder" that uses a state machine to look for the "} {" pairs, and * An indirection around the get-token method so we can push the "{" token back onto the input. These changes, along with the pseudo-code shown earlier may provide as good a solution as we can get. (Phrased that way because some errors will cause two records to be discarded, as explained earlier.) Combine that with the options and error reporting from the original pull request and we are probably pretty close.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---