Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Totally agree that specifying the schema manually should be the
baseline.  LGTM, thanks for working on it.  Seems like it looks good
to others too judging by the comment on the PR that it's getting
merged to master :)

On Thu, Sep 29, 2016 at 2:13 PM, Michael Armbrust
 wrote:
>> Will this be able to handle projection pushdown if a given job doesn't
>> utilize all the columns in the schema?  Or should people have a
>>
>> per-job schema?
>
>
> As currently written, we will do a little bit of extra work to pull out
> fields that aren't needed.  I think it would be pretty straight forward to
> add a rule to the optimizer that prunes the schema passed to the
> JsonToStruct expression when there is another Project operator present.
>
>> I’m not a spark guru, but I would have hoped that DataSets and DataFrames
>> were more dynamic.
>
>
> We are dynamic in that all of these decisions can be made at runtime, and
> you can even look at the data when making them.  We do however need to know
> the schema before any single query begins executing so that we can give good
> analysis error messages and so that we can generate efficient byte code in
> our code generation.
>
>>
>> You should be doing schema inference. JSON includes the schema with each
>> record and you should take advantage of it. I guess the only issue is that
>> DataSets / DataFrames have static schemas and structures. Then if your first
>> record doesn’t include all of the columns you will have a problem.
>
>
> I agree that for ad-hoc use cases we should make it easy to infer the
> schema.  I would also argue that for a production pipeline you need the
> ability to specify it manually to avoid surprises.
>
> There are several tricky cases here.  You bring up the fact that the first
> record might be missing fields, but in many data sets there are fields that
> are only present in 1 out of 100,000s records.  Even if all fields are
> present, sometimes it can be very expensive to get even the first record
> (say you are reading from an expensive query coming from the JDBC data
> source).
>
> Another issue, is that inference means you need to read some data before the
> user explicitly starts the query.  Historically, cases where we do this have
> been pretty confusing to users of Spark (think: the surprise job that finds
> partition boundaries for RDD.sort).
>
> So, I think we should add inference, but that it should be in addition to
> the API proposed in this PR.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark SQL JSON Column Support

2016-09-29 Thread Michael Armbrust
>
> Will this be able to handle projection pushdown if a given job doesn't
> utilize all the columns in the schema?  Or should people have a

per-job schema?
>

As currently written, we will do a little bit of extra work to pull out
fields that aren't needed.  I think it would be pretty straight forward to
add a rule to the optimizer that prunes the schema passed to the
JsonToStruct expression when there is another Project operator present.

I’m not a spark guru, but I would have hoped that DataSets and DataFrames
> were more dynamic.


We are dynamic in that all of these decisions can be made at runtime, and
you can even look at the data when making them.  We do however need to know
the schema before any single query begins executing so that we can give
good analysis error messages and so that we can generate efficient byte
code in our code generation.


> You should be doing schema inference. JSON includes the schema with each
> record and you should take advantage of it. I guess the only issue is
> that DataSets / DataFrames have static schemas and structures. Then if your
> first record doesn’t include all of the columns you will have a problem.


I agree that for ad-hoc use cases we should make it easy to infer the
schema.  I would also argue that for a production pipeline you need the
ability to specify it manually to avoid surprises.

There are several tricky cases here.  You bring up the fact that the first
record might be missing fields, but in many data sets there are fields that
are only present in 1 out of 100,000s records.  Even if all fields are
present, sometimes it can be very expensive to get even the first record
(say you are reading from an expensive query coming from the JDBC data
source).

Another issue, is that inference means you need to read some data before
the user explicitly starts the query.  Historically, cases where we do this
have been pretty confusing to users of Spark (think: the surprise job that
finds partition boundaries for RDD.sort).

So, I think we should add inference, but that it should be in addition to
the API proposed in this PR.


Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Will this be able to handle projection pushdown if a given job doesn't
utilize all the columns in the schema?  Or should people have a
per-job schema?

On Wed, Sep 28, 2016 at 2:17 PM, Michael Armbrust
 wrote:
> Burak, you can configure what happens with corrupt records for the
> datasource using the parse mode.  The parse will still fail, so we can't get
> any data out of it, but we do leave the JSON in another column for you to
> inspect.
>
> In the case of this function, we'll just return null if its unparable.  You
> could filter for rows where the function returns null and inspect the input
> if you want to see whats going wrong.
>
>> When you talk about ‘user specified schema’ do you mean for the user to
>> supply an additional schema, or that you’re using the schema that’s
>> described by the JSON string?
>
>
> I mean we don't do schema inference (which we might consider adding, but
> that would be a much larger change than this PR).  You need to construct a
> StructType that says what columns you want to extract from the JSON column
> and pass that in.  I imagine in many cases the user will run schema
> inference ahead of time and then encode the inferred schema into their
> program.
>
>
> On Wed, Sep 28, 2016 at 11:04 AM, Burak Yavuz  wrote:
>>
>> I would really love something like this! It would be great if it doesn't
>> throw away corrupt_records like the Data Source.
>>
>> On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande 
>> wrote:
>>>
>>> We are currently pulling out the JSON columns, passing them through
>>> read.json, and then joining them back onto the initial DF so something like
>>> from_json would be a nice quality of life improvement for us.
>>>
>>> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust
>>>  wrote:

 Spark SQL has great support for reading text files that contain JSON
 data. However, in many cases the JSON data is just one column amongst
 others. This is particularly true when reading from sources such as Kafka.
 This PR adds a new functions from_json that converts a string column into a
 nested StructType with a user specified schema, using the same internal
 logic as the json Data Source.

 Would love to hear any comments / suggestions.

 Michael
>>>
>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Burak, you can configure what happens with corrupt records for the
datasource using the parse mode.  The parse will still fail, so we can't
get any data out of it, but we do leave the JSON in another column for you
to inspect.

In the case of this function, we'll just return null if its unparable.  You
could filter for rows where the function returns null and inspect the input
if you want to see whats going wrong.

When you talk about ‘user specified schema’ do you mean for the user to
> supply an additional schema, or that you’re using the schema that’s
> described by the JSON string?


I mean we don't do schema inference (which we might consider adding, but
that would be a much larger change than this PR).  You need to construct a
StructType that says what columns you want to extract from the JSON column
and pass that in.  I imagine in many cases the user will run schema
inference ahead of time and then encode the inferred schema into their
program.


On Wed, Sep 28, 2016 at 11:04 AM, Burak Yavuz  wrote:

> I would really love something like this! It would be great if it doesn't
> throw away corrupt_records like the Data Source.
>
> On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande 
> wrote:
>
>> We are currently pulling out the JSON columns, passing them through
>> read.json, and then joining them back onto the initial DF so something like
>> from_json would be a nice quality of life improvement for us.
>>
>> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Spark SQL has great support for reading text files that contain JSON
>>> data. However, in many cases the JSON data is just one column amongst
>>> others. This is particularly true when reading from sources such as Kafka. 
>>> This
>>> PR  adds a new functions
>>> from_json that converts a string column into a nested StructType with a
>>> user specified schema, using the same internal logic as the json Data
>>> Source.
>>>
>>> Would love to hear any comments / suggestions.
>>>
>>> Michael
>>>
>>
>>
>


Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Segel
Silly  question?
When you talk about ‘user specified schema’ do you mean for the user to supply 
an additional schema, or that you’re using the schema that’s described by the 
JSON string?
(or both? [either/or] )

Thx

On Sep 28, 2016, at 12:52 PM, Michael Armbrust 
> wrote:

Spark SQL has great support for reading text files that contain JSON data. 
However, in many cases the JSON data is just one column amongst others. This is 
particularly true when reading from sources such as Kafka. This 
PR adds a new functions from_json 
that converts a string column into a nested StructType with a user specified 
schema, using the same internal logic as the json Data Source.

Would love to hear any comments / suggestions.

Michael



Re: Spark SQL JSON Column Support

2016-09-28 Thread Burak Yavuz
I would really love something like this! It would be great if it doesn't
throw away corrupt_records like the Data Source.

On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande 
wrote:

> We are currently pulling out the JSON columns, passing them through
> read.json, and then joining them back onto the initial DF so something like
> from_json would be a nice quality of life improvement for us.
>
> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust  > wrote:
>
>> Spark SQL has great support for reading text files that contain JSON
>> data. However, in many cases the JSON data is just one column amongst
>> others. This is particularly true when reading from sources such as Kafka. 
>> This
>> PR  adds a new functions
>> from_json that converts a string column into a nested StructType with a
>> user specified schema, using the same internal logic as the json Data
>> Source.
>>
>> Would love to hear any comments / suggestions.
>>
>> Michael
>>
>
>


Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Spark SQL has great support for reading text files that contain JSON data.
However, in many cases the JSON data is just one column amongst others.
This is particularly true when reading from sources such as Kafka. This PR
 adds a new functions
from_json that
converts a string column into a nested StructType with a user specified
schema, using the same internal logic as the json Data Source.

Would love to hear any comments / suggestions.

Michael