Hi Sabbu,
Just made a comment on this very topic in the pull request for DRILL-4653.
Looking at our JSON examples, it seems that we often include newlines inside a
JSON record. That seems fine, it is legal JSON. I’d also guess that the JSON
tokenizer we use may silently ignore whitespace.
On the other hand, our “sequence of JSON objects” format for JSON files is NOT
JSON. (There is quite a bit of discussion on this on Slashdot.) If it was valid
JSON, it would appear like this:
[ { … }, { … } ]
So, I think you are on the right track with the } { question. Assume that } {
can appear on the same line (or on different lines.) Since } { is not valid
JSON, it can only appear at the boundary between two JSON records. So, you can
use } { (or, more generally, }/s*{) as a landmark to know where one (perhaps
badly formed) JSON record ends and another begins.
The only trick is that, when looking for }/s*{, we must push the { back onto
the input so it can be read again when processing the next (good) record.
See the pull request comment for details.
- Paul
> On Sep 8, 2016, at 10:50 AM, Subbu Srinivasan <[email protected]> wrote:
>
> Folks,
> What is the general thoughts of the team on how DRILL parses input data?
> Is the philosophy that the input is typically delimited (Eg: new line in
> case of JSON data)
> or should we remain agnostic and let the underlying parser implementation
> interpret start of
> the next input record ?
>
> Is it valid to have two json records in a single line?
>
> {"json"}{"json"}
>
> I am asking this from DRILL-4653 perspective?
>
>
> Thanks
> Subbu S