Re: Question on input file parsing

Paul Rogers Mon, 12 Sep 2016 23:26:29 -0700

Hi Sabbu,

Just made a comment on this very topic in the pull request for DRILL-4653.

Looking at our JSON examples, it seems that we often include newlines inside a 
JSON record. That seems fine, it is legal JSON. I’d also guess that the JSON 
tokenizer we use may silently ignore whitespace.

On the other hand, our “sequence of JSON objects” format for JSON files is NOT 
JSON. (There is quite a bit of discussion on this on Slashdot.) If it was valid 
JSON, it would appear like this:

[ { … }, { … } ]

So, I think you are on the right track with the } { question. Assume that } { 
can appear on the same line (or on different lines.) Since } { is not valid 
JSON, it can only appear at the boundary between two JSON records. So, you can 
use } { (or, more generally, }/s*{) as a landmark to know where one (perhaps 
badly formed) JSON record ends and another begins.

The only trick is that, when looking for }/s*{, we must push the { back onto 
the input so it can be read again when processing the next (good) record.

See the pull request comment for details.

- Paul

> On Sep 8, 2016, at 10:50 AM, Subbu Srinivasan <[email protected]> wrote:
> 
> Folks,
> What is the general thoughts of the team on how DRILL parses input data?
> Is the philosophy that the input is typically delimited (Eg: new line in
> case of JSON data)
> or should we remain agnostic and let the underlying parser implementation
> interpret start of
> the next input record ?
> 
> Is it valid to have two json records in a single line?
> 
> {"json"}{"json"}
> 
> I am asking this from DRILL-4653 perspective?
> 
> 
> Thanks
> Subbu S

Re: Question on input file parsing

Reply via email to