[
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983980#comment-14983980
]
Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:53 PM:
--------------------------------------------------------------------
Hah, I actually missed that, so my bad, thanks!
What I'm doing now then is along the lines of the following (example contrived
from
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
only just realized this was you):
{code}
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010},
{"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
{code}
I must say I've felt slightly puzzled with the convention of having to use
`explode` as part of an embedding `select` statement though; as an unwitting
user, I'd feel `df.explode($"col")` should do something functionally equivalent
to the current `df.select($"*", explode($"col"))` without having to type that
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset
of columns, I could just manually add a `select` to do so myself'.
Obviously, changing user APIs is bad, and not everyone will have identical
expectations, but I'm just kind of curious. Was this an artifact of performance
considerations, or a deliberate part of a larger philosophy of having the
syntax be as explicit as possible?
Then again, aside from keeping existing column, to me the `drop` on the
pre-`explode` column would often seem a sensible default as well, so point
taken that expectations may differ, in which case defaulting to whatever takes
least processing definitely seems a sane choice.
Something else I was thinking about though would be an `explodeZipped` type of
function, to explode multiple equally-sized-array columns together, as opposed
to chaining separate explodes to form a Cartesian. I was still sort of looking
into that, but... at this point I'd wonder if perhaps I've overlooked existing
functionality for that as well. :)
was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!
What I'm doing now then is along the lines of the following (example contrived
from
[here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html),
only just realized this was you):
val json = """{"name":"Michael", "schools":[{"sname":"stanford",
"year":2010}, {"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
I must say I've felt slightly puzzled with the convention of having to use
`explode` as part of an embedding `select` statement though; as an unwitting
user, I'd feel `df.explode($"col")` should do something functionally equivalent
to the current `df.select($"*", explode($"col"))` without having to type that
out though. In my naivety, I'd wonder 'if I wanted to also select just a subset
of columns, I could just manually add a `select` to do so myself'.
Obviously, changing user APIs is bad, and not everyone will have identical
expectations, but I'm just kind of curious. Was this an artifact of performance
considerations, or a deliberate part of a larger philosophy of having the
syntax be as explicit as possible?
Then again, aside from keeping existing column, to me the `drop` on the
pre-`explode` column would often seem a sensible default as well, so point
taken that expectations may differ, in which case defaulting to whatever takes
least processing definitely seems a sane choice.
Something else I was thinking about though would be an `explodeZipped` type of
function, to explode multiple equally-sized-array columns together, as opposed
to chaining separate explodes to form a Cartesian. I was still sort of looking
into that, but... at this point I'd wonder if perhaps I've overlooked existing
functionality for that as well. :)
> Allow exploding arrays of structs in DataFrames
> -----------------------------------------------
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Tycho Grouwstra
> Labels: features
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to
> explode an array of structs (as are common in JSON) to their own rows so I
> could start analyzing the data using GraphX. I believe many others might have
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to
> DataFrames by [~marmbrus], which currently errors when you call it on such
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor
> function to infer column types -- this approach is insufficient in the case
> of Rows, since their type does not contain the required info. The alternative
> here would be to instead grab the schema info from the existing schema for
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay
> tuned until I've figured that out. I'm new here though so I'll probably have
> use for some feedback...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]