Hi Kant, Ah, I thought you wanted to find the workaround to so it.
Then wouldn't this be easily able to reach the same goal with the workaround without new such API? Thanks. On 6 Dec 2016 4:11 a.m., "kant kodali" <kanth...@gmail.com> wrote: > Hi Kwon, > > Thanks for this but Isn't this what Michael suggested? > > Thanks, > kant > > On Mon, Dec 5, 2016 at 4:45 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Hi Kant, >> >> How about doing something like this? >> >> import org.apache.spark.sql.functions._ >> >> // val df2 = df.select(df("body").cast(StringType).as("body")) >> val df2 = Seq("""{"a": 1}""").toDF("body") >> val schema = spark.read.json(df2.as[String].rdd).schema >> df2.select(from_json(col("body"), schema)).show() >> >> >> >> 2016-12-05 19:51 GMT+09:00 kant kodali <kanth...@gmail.com>: >> >>> Hi Michael, >>> >>> " Personally, I usually take a small sample of data and use schema >>> inference on that. I then hardcode that schema into my program. This >>> makes your spark jobs much faster and removes the possibility of the schema >>> changing underneath the covers." >>> >>> This may or may not work for us. Not all rows have the same schema. The >>> number of distinct schemas we have now may be small but going forward this >>> can go to any number moreover a distinct call can lead to a table scan >>> which can be billions of rows for us. >>> >>> I also would agree to keep the API consistent than making an exception >>> however I wonder if it make sense to provide an action call to infer the >>> schema which would return a new dataframe after the action call finishes >>> (after schema inference)? For example, something like below ? >>> >>> val inferedDF = df.inferSchema(col1); >>> >>> Thanks, >>> >>> >>> >>> >>> On Mon, Nov 28, 2016 at 6:12 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> You could open up a JIRA to add a version of from_json that supports >>>> schema inference, but unfortunately that would not be super easy to >>>> implement. In particular, it would introduce a weird case where only this >>>> specific function would block for a long time while we infer the schema >>>> (instead of waiting for an action). This blocking would be kind of odd for >>>> a call like df.select(...). If there is enough interest, though, we >>>> should still do it. >>>> >>>> To give a little more detail, your version of the code is actually >>>> doing two passes over the data: one to infer the schema and a second for >>>> whatever processing you are asking it to do. We have to know the schema at >>>> each step of DataFrame construction, so we'd have to do this even before >>>> you called an action. >>>> >>>> Personally, I usually take a small sample of data and use schema >>>> inference on that. I then hardcode that schema into my program. This >>>> makes your spark jobs much faster and removes the possibility of the schema >>>> changing underneath the covers. >>>> >>>> Here's some code I use to build the static schema code automatically >>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1128172975083446/2840265927289860/latest.html> >>>> . >>>> >>>> Would that work for you? If not, why not? >>>> >>>> On Wed, Nov 23, 2016 at 2:48 AM, kant kodali <kanth...@gmail.com> >>>> wrote: >>>> >>>>> Hi Michael, >>>>> >>>>> Looks like all from_json functions will require me to pass schema and >>>>> that can be little tricky for us but the code below doesn't require me to >>>>> pass schema at all. >>>>> >>>>> import org.apache.spark.sql._ >>>>> val rdd = df2.rdd.map { case Row(j: String) => j } >>>>> spark.read.json(rdd).show() >>>>> >>>>> >>>>> On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust < >>>>> mich...@databricks.com> wrote: >>>>> >>>>>> The first release candidate should be coming out this week. You can >>>>>> subscribe to the dev list if you want to follow the release schedule. >>>>>> >>>>>> On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Michael, >>>>>>> >>>>>>> I only see spark 2.0.2 which is what I am using currently. Any idea >>>>>>> on when 2.1 will be released? >>>>>>> >>>>>>> Thanks, >>>>>>> kant >>>>>>> >>>>>>> On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust < >>>>>>> mich...@databricks.com> wrote: >>>>>>> >>>>>>>> In Spark 2.1 we've added a from_json >>>>>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2902> >>>>>>>> function that I think will do what you want. >>>>>>>> >>>>>>>> On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This seem to work >>>>>>>>> >>>>>>>>> import org.apache.spark.sql._ >>>>>>>>> val rdd = df2.rdd.map { case Row(j: String) => j } >>>>>>>>> spark.read.json(rdd).show() >>>>>>>>> >>>>>>>>> However I wonder if this any inefficiency here ? since I have to >>>>>>>>> apply this function for billion rows. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >