[
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606328#comment-14606328
]
Nicholas Chammas edited comment on SPARK-8670 at 6/29/15 9:01 PM:
------------------------------------------------------------------
After a discussion with [~davies], it appears that the new way to access or
reference a nested field in 1.4 using the \_\_getitem\_\_ syntax is as follows:
{code}
# corrected example
df['stats']['age'] # 1.4 works, 1.3 doesn't
# original example
df['stats.age'] # 1.3 works, 1.4 doesn't
{code}
So it looks like something changed from 1.3 and 1.4, and the new way is the way
of the future.
Thankfully, the corrected example is clearer than the original, and I
understand from [~yhuai] that 1.4 now supports column names with dots in them,
so `df\['stats.age'\]` in 1.4 would reference a non-existent column.
Marking this as not an issue, even though technically something that worked in
1.3 no longer works in 1.4.
was (Author: nchammas):
After a discussion with [~davies], it appears that the way to access or
reference a nested field in both 1.3 and 1.4 is as follows:
{code}
# corrected example
df['stats']['age'] # works on both 1.3 and 1.4
# original example
df['stats.age'] # 1.3 works, 1.4 doesn't
{code}
So I'm not sure this is a bug so much as it is just a misunderstanding of how
to access nested fields combined with a change in expressions are parsed.
Thankfully, the corrected example is clearer than the original, and I
understand from [~yhuai] that 1.4 now supports column names with dots in them,
so `df\['stats.age'\]` in 1.4 would reference a non-existent column.
Marking this as not an issue.
> Nested columns can't be referenced (but they can be selected)
> -------------------------------------------------------------
>
> Key: SPARK-8670
> URL: https://issues.apache.org/jira/browse/SPARK-8670
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.4.0
> Reporter: Nicholas Chammas
>
> This is strange and looks like a regression from 1.3.
> {code}
> import json
> daterz = [
> {
> 'name': 'Nick',
> 'stats': {
> 'age': 28
> }
> },
> {
> 'name': 'George',
> 'stats': {
> 'age': 31
> }
> }
> ]
> df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
> df.select('stats.age').show()
> df['stats.age'] # 1.4 fails on this line
> {code}
> On 1.3 this works and yields:
> {code}
> age
> 28
> 31
> Out[1]: Column<stats.age AS age#2958L>
> {code}
> On 1.4, however, this gives an error on the last line:
> {code}
> +---+
> |age|
> +---+
> | 28|
> | 31|
> +---+
> ---------------------------------------------------------------------------
> IndexError Traceback (most recent call last)
> <ipython-input-1-04bd990e94c6> in <module>()
> 19
> 20 df.select('stats.age').show()
> ---> 21 df['stats.age']
> /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
> 678 if isinstance(item, basestring):
> 679 if item not in self.columns:
> --> 680 raise IndexError("no such column: %s" % item)
> 681 jc = self._jdf.apply(item)
> 682 return Column(jc)
> IndexError: no such column: stats.age
> {code}
> This means, among other things, that you can't join DataFrames on nested
> columns.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]