Re: Spark SQL JSON Column Support

Cody Koeninger Thu, 29 Sep 2016 08:08:23 -0700

Will this be able to handle projection pushdown if a given job doesn't
utilize all the columns in the schema?  Or should people have a
per-job schema?


On Wed, Sep 28, 2016 at 2:17 PM, Michael Armbrust
<mich...@databricks.com> wrote:
> Burak, you can configure what happens with corrupt records for the
> datasource using the parse mode.  The parse will still fail, so we can't get
> any data out of it, but we do leave the JSON in another column for you to
> inspect.
>
> In the case of this function, we'll just return null if its unparable.  You
> could filter for rows where the function returns null and inspect the input
> if you want to see whats going wrong.
>
>> When you talk about ‘user specified schema’ do you mean for the user to
>> supply an additional schema, or that you’re using the schema that’s
>> described by the JSON string?
>
>
> I mean we don't do schema inference (which we might consider adding, but
> that would be a much larger change than this PR).  You need to construct a
> StructType that says what columns you want to extract from the JSON column
> and pass that in.  I imagine in many cases the user will run schema
> inference ahead of time and then encode the inferred schema into their
> program.
>
>
> On Wed, Sep 28, 2016 at 11:04 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>
>> I would really love something like this! It would be great if it doesn't
>> throw away corrupt_records like the Data Source.
>>
>> On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande <nathanla...@gmail.com>
>> wrote:
>>>
>>> We are currently pulling out the JSON columns, passing them through
>>> read.json, and then joining them back onto the initial DF so something like
>>> from_json would be a nice quality of life improvement for us.
>>>
>>> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust
>>> <mich...@databricks.com> wrote:
>>>>
>>>> Spark SQL has great support for reading text files that contain JSON
>>>> data. However, in many cases the JSON data is just one column amongst
>>>> others. This is particularly true when reading from sources such as Kafka.
>>>> This PR adds a new functions from_json that converts a string column into a
>>>> nested StructType with a user specified schema, using the same internal
>>>> logic as the json Data Source.
>>>>
>>>> Would love to hear any comments / suggestions.
>>>>
>>>> Michael
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark SQL JSON Column Support

Reply via email to