Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

Hyukjin Kwon Mon, 05 Dec 2016 14:21:06 -0800

Hi Kant,

Ah, I thought you wanted to find the workaround to so it.


Then wouldn't this be easily able to reach the same goal with the
workaround without new such API?

Thanks.




On 6 Dec 2016 4:11 a.m., "kant kodali" <kanth...@gmail.com> wrote:

> Hi Kwon,
>
> Thanks for this but Isn't this what Michael suggested?
>
> Thanks,
> kant
>
> On Mon, Dec 5, 2016 at 4:45 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Hi Kant,
>>
>> How about doing something like this?
>>
>> import org.apache.spark.sql.functions._
>>
>> // val df2 = df.select(df("body").cast(StringType).as("body"))
>> val df2 = Seq("""{"a": 1}""").toDF("body")
>> val schema = spark.read.json(df2.as[String].rdd).schema
>> df2.select(from_json(col("body"), schema)).show()
>>
>> 
>>
>> 2016-12-05 19:51 GMT+09:00 kant kodali <kanth...@gmail.com>:
>>
>>> Hi Michael,
>>>
>>> " Personally, I usually take a small sample of data and use schema
>>> inference on that.  I then hardcode that schema into my program.  This
>>> makes your spark jobs much faster and removes the possibility of the schema
>>> changing underneath the covers."
>>>
>>> This may or may not work for us. Not all rows have the same schema. The
>>> number of distinct schemas we have now may be small but going forward this
>>> can go to any number moreover a distinct call can lead to a table scan
>>> which can be billions of rows for us.
>>>
>>> I also would agree to keep the API consistent than making an exception
>>> however I wonder if it make sense to provide an action call to infer the
>>> schema which would return a new dataframe after the action call finishes
>>> (after schema inference)? For example, something like below ?
>>>
>>> val inferedDF = df.inferSchema(col1);
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>> On Mon, Nov 28, 2016 at 6:12 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> You could open up a JIRA to add a version of from_json that supports
>>>> schema inference, but unfortunately that would not be super easy to
>>>> implement.  In particular, it would introduce a weird case where only this
>>>> specific function would block for a long time while we infer the schema
>>>> (instead of waiting for an action).  This blocking would be kind of odd for
>>>> a call like df.select(...).  If there is enough interest, though, we
>>>> should still do it.
>>>>
>>>> To give a little more detail, your version of the code is actually
>>>> doing two passes over the data: one to infer the schema and a second for
>>>> whatever processing you are asking it to do.  We have to know the schema at
>>>> each step of DataFrame construction, so we'd have to do this even before
>>>> you called an action.
>>>>
>>>> Personally, I usually take a small sample of data and use schema
>>>> inference on that.  I then hardcode that schema into my program.  This
>>>> makes your spark jobs much faster and removes the possibility of the schema
>>>> changing underneath the covers.
>>>>
>>>> Here's some code I use to build the static schema code automatically
>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1128172975083446/2840265927289860/latest.html>
>>>> .
>>>>
>>>> Would that work for you? If not, why not?
>>>>
>>>> On Wed, Nov 23, 2016 at 2:48 AM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Looks like all from_json functions will require me to pass schema and
>>>>> that can be little tricky for us but the code below doesn't require me to
>>>>> pass schema at all.
>>>>>
>>>>> import org.apache.spark.sql._
>>>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>>>> spark.read.json(rdd).show()
>>>>>
>>>>>
>>>>> On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>>> The first release candidate should be coming out this week. You can
>>>>>> subscribe to the dev list if you want to follow the release schedule.
>>>>>>
>>>>>> On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Michael,
>>>>>>>
>>>>>>> I only see spark 2.0.2 which is what I am using currently. Any idea
>>>>>>> on when 2.1 will be released?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> kant
>>>>>>>
>>>>>>> On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <
>>>>>>> mich...@databricks.com> wrote:
>>>>>>>
>>>>>>>> In Spark 2.1 we've added a from_json
>>>>>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2902>
>>>>>>>> function that I think will do what you want.
>>>>>>>>
>>>>>>>> On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This seem to work
>>>>>>>>>
>>>>>>>>> import org.apache.spark.sql._
>>>>>>>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>>>>>>>> spark.read.json(rdd).show()
>>>>>>>>>
>>>>>>>>> However I wonder if this any inefficiency here ? since I have to
>>>>>>>>> apply this function for billion rows.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

Reply via email to