Will this be able to handle projection pushdown if a given job doesn't utilize all the columns in the schema? Or should people have a per-job schema?
On Wed, Sep 28, 2016 at 2:17 PM, Michael Armbrust <mich...@databricks.com> wrote: > Burak, you can configure what happens with corrupt records for the > datasource using the parse mode. The parse will still fail, so we can't get > any data out of it, but we do leave the JSON in another column for you to > inspect. > > In the case of this function, we'll just return null if its unparable. You > could filter for rows where the function returns null and inspect the input > if you want to see whats going wrong. > >> When you talk about ‘user specified schema’ do you mean for the user to >> supply an additional schema, or that you’re using the schema that’s >> described by the JSON string? > > > I mean we don't do schema inference (which we might consider adding, but > that would be a much larger change than this PR). You need to construct a > StructType that says what columns you want to extract from the JSON column > and pass that in. I imagine in many cases the user will run schema > inference ahead of time and then encode the inferred schema into their > program. > > > On Wed, Sep 28, 2016 at 11:04 AM, Burak Yavuz <brk...@gmail.com> wrote: >> >> I would really love something like this! It would be great if it doesn't >> throw away corrupt_records like the Data Source. >> >> On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande <nathanla...@gmail.com> >> wrote: >>> >>> We are currently pulling out the JSON columns, passing them through >>> read.json, and then joining them back onto the initial DF so something like >>> from_json would be a nice quality of life improvement for us. >>> >>> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust >>> <mich...@databricks.com> wrote: >>>> >>>> Spark SQL has great support for reading text files that contain JSON >>>> data. However, in many cases the JSON data is just one column amongst >>>> others. This is particularly true when reading from sources such as Kafka. >>>> This PR adds a new functions from_json that converts a string column into a >>>> nested StructType with a user specified schema, using the same internal >>>> logic as the json Data Source. >>>> >>>> Would love to hear any comments / suggestions. >>>> >>>> Michael >>> >>> >> > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org