[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas updated SPARK-8670: ------------------------------------ Description: This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Column<stats.age AS age#2958L> {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-1-04bd990e94c6> in <module>() 19 20 df.select('stats.age').show() ---> 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: --> 680 raise IndexError("no such column: %s" % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. was: This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Column<stats.age AS age#2958L> {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-1-04bd990e94c6> in <module>() 19 20 df.select('stats.age').show() ---> 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: --> 680 raise IndexError("no such column: %s" % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} > Nested columns can't be referenced (but they can be selected) > ------------------------------------------------------------- > > Key: SPARK-8670 > URL: https://issues.apache.org/jira/browse/SPARK-8670 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.4.0 > Reporter: Nicholas Chammas > > This is strange and looks like a regression from 1.3. > {code} > import json > daterz = [ > { > 'name': 'Nick', > 'stats': { > 'age': 28 > } > }, > { > 'name': 'George', > 'stats': { > 'age': 31 > } > } > ] > df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) > df.select('stats.age').show() > df['stats.age'] # 1.4 fails on this line > {code} > On 1.3 this works and yields: > {code} > age > 28 > 31 > Out[1]: Column<stats.age AS age#2958L> > {code} > On 1.4, however, this gives an error on the last line: > {code} > +---+ > |age| > +---+ > | 28| > | 31| > +---+ > --------------------------------------------------------------------------- > IndexError Traceback (most recent call last) > <ipython-input-1-04bd990e94c6> in <module>() > 19 > 20 df.select('stats.age').show() > ---> 21 df['stats.age'] > /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) > 678 if isinstance(item, basestring): > 679 if item not in self.columns: > --> 680 raise IndexError("no such column: %s" % item) > 681 jc = self._jdf.apply(item) > 682 return Column(jc) > IndexError: no such column: stats.age > {code} > This means, among other things, that you can't join DataFrames on nested > columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org