Re: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Michael Segel
Siegmann Cc: user @spark Subject: Re: Quirk in how Spark DF handles JSON input records? On Nov 2, 2016, at 2:22 PM, Daniel Siegmann <dsiegm...@securityscorecard.io<mailto:dsiegm...@securityscorecard.io>> wrote: Yes, it needs to be on a single line. Spark (or Hadoop really) treat

RE: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Mendelson, Assaf
k Subject: Re: Quirk in how Spark DF handles JSON input records? On Nov 2, 2016, at 2:22 PM, Daniel Siegmann <dsiegm...@securityscorecard.io<mailto:dsiegm...@securityscorecard.io>> wrote: Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a re

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
On Nov 2, 2016, at 2:22 PM, Daniel Siegmann > wrote: Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record separator by default. While it is possible to use a different string as a

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Daniel Siegmann
Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record separator by default. While it is possible to use a different string as a record separator, what would you use in the case of JSON? If you do some Googling I suspect you'll find some possible solutions.

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
ARGH!! Looks like a formatting issue. Spark doesn’t like ‘pretty’ output. So then the entire record which defines the schema has to be a single line? Really? On Nov 2, 2016, at 1:50 PM, Michael Segel > wrote: This may be a silly

Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
This may be a silly mistake on my part… Doing an example using Chicago’s Crime data.. (There’s a lot of it going around. ;-) The goal is to read a file containing a JSON record that describes the crime data.csv for ingestion into a data frame, then I want to output to a Parquet file. (Pretty