Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Upon reflection, it seems that newline is not an adequate marker to
separate JSON records. Many of our samples have internal newlines. If a newline
appears inside the JSON record, then we are subject to the same incorrect
recovery as illustrated with the "a, x, bar, y" example in the earlier comment.
Further, if the JSON tokenizer is like most, it probably discards
whitespace, not returning EOL as a token.
So, it seems that the best (or only) option is to scan for the "} {" pair.
This requires two specific improvements:
* A "token discarder" that uses a state machine to look for the "} {"
pairs, and
* An indirection around the get-token method so we can push the "{" token
back onto the input.
These changes, along with the pseudo-code shown earlier may provide as good
a solution as we can get. (Phrased that way because some errors will cause two
records to be discarded, as explained earlier.) Combine that with the options
and error reporting from the original pull request and we are probably pretty
close.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---