[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

Nicholas Chammas (JIRA) Fri, 26 Jun 2015 14:05:15 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicholas Chammas updated SPARK-8670:
------------------------------------
    Description: 
This is strange and looks like a regression from 1.3.

{code}
import json

daterz = [
  {
    'name': 'Nick',
    'stats': {
      'age': 28
    }
  },
  {
    'name': 'George',
    'stats': {
      'age': 31
    }
  }
]

df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))

df.select('stats.age').show()
df['stats.age']  # 1.4 fails on this line
{code}

On 1.3 this works and yields:

{code}
age
28 
31 
Out[1]: Column<stats.age AS age#2958L>
{code}

On 1.4, however, this gives an error on the last line:

{code}
+---+
|age|
+---+
| 28|
| 31|
+---+

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-04bd990e94c6> in <module>()
     19 
     20 df.select('stats.age').show()
---> 21 df['stats.age']

/path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
    678         if isinstance(item, basestring):
    679             if item not in self.columns:
--> 680                 raise IndexError("no such column: %s" % item)
    681             jc = self._jdf.apply(item)
    682             return Column(jc)

IndexError: no such column: stats.age
{code}

This means, among other things, that you can't join DataFrames on nested 
columns.

  was:
This is strange and looks like a regression from 1.3.

{code}
import json

daterz = [
  {
    'name': 'Nick',
    'stats': {
      'age': 28
    }
  },
  {
    'name': 'George',
    'stats': {
      'age': 31
    }
  }
]

df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))

df.select('stats.age').show()
df['stats.age']  # 1.4 fails on this line
{code}

On 1.3 this works and yields:

{code}
age
28 
31 
Out[1]: Column<stats.age AS age#2958L>
{code}

On 1.4, however, this gives an error on the last line:

{code}
+---+
|age|
+---+
| 28|
| 31|
+---+

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-04bd990e94c6> in <module>()
     19 
     20 df.select('stats.age').show()
---> 21 df['stats.age']

/path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
    678         if isinstance(item, basestring):
    679             if item not in self.columns:
--> 680                 raise IndexError("no such column: %s" % item)
    681             jc = self._jdf.apply(item)
    682             return Column(jc)

IndexError: no such column: stats.age
{code}


> Nested columns can't be referenced (but they can be selected)
> -------------------------------------------------------------
>
>                 Key: SPARK-8670
>                 URL: https://issues.apache.org/jira/browse/SPARK-8670
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0
>            Reporter: Nicholas Chammas
>
> This is strange and looks like a regression from 1.3.
> {code}
> import json
> daterz = [
>   {
>     'name': 'Nick',
>     'stats': {
>       'age': 28
>     }
>   },
>   {
>     'name': 'George',
>     'stats': {
>       'age': 31
>     }
>   }
> ]
> df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
> df.select('stats.age').show()
> df['stats.age']  # 1.4 fails on this line
> {code}
> On 1.3 this works and yields:
> {code}
> age
> 28 
> 31 
> Out[1]: Column<stats.age AS age#2958L>
> {code}
> On 1.4, however, this gives an error on the last line:
> {code}
> +---+
> |age|
> +---+
> | 28|
> | 31|
> +---+
> ---------------------------------------------------------------------------
> IndexError                                Traceback (most recent call last)
> <ipython-input-1-04bd990e94c6> in <module>()
>      19 
>      20 df.select('stats.age').show()
> ---> 21 df['stats.age']
> /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
>     678         if isinstance(item, basestring):
>     679             if item not in self.columns:
> --> 680                 raise IndexError("no such column: %s" % item)
>     681             jc = self._jdf.apply(item)
>     682             return Column(jc)
> IndexError: no such column: stats.age
> {code}
> This means, among other things, that you can't join DataFrames on nested 
> columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

Reply via email to