[jira] [Updated] (SPARK-21537) toPandas() should handle nested columns (as a Pandas MultiIndex)

Eric O. LEBIGOT (EOL) (JIRA) Wed, 26 Jul 2017 03:26:24 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric O. LEBIGOT (EOL) updated SPARK-21537:
------------------------------------------
    Description: 
The conversion of a *PySpark dataframe with nested columns* to Pandas (with 
`toPandas()`) does not convert nested columns into their Pandas equivalent, 
i.e. *columns indexed by a 
[MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.

For example, a dataframe with the following structure:
{code:java}
>>> df.printSchema()
root
 |-- device_ID: string (nullable = true)
 |-- time_origin_UTC: timestamp (nullable = true)
 |-- duration_s: integer (nullable = true)
 |-- session_time_UTC: timestamp (nullable = true)
 |-- probes_by_AP: struct (nullable = true)
 |    |-- aa:bb:cc:dd:ee:ff: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- delay_s: float (nullable = true)
 |    |    |    |-- RSSI: short (nullable = true)
 |-- max_RSSI_info_by_AP: struct (nullable = true)
 |    |-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
 |    |    |-- delay_s: float (nullable = true)
 |    |    |-- RSSI: short (nullable = true)
{code}
yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ 
nested inside Pandas (through a MultiIndex):
{code}
>>> df_pandas_version = df.toPandas()
>>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # 
>>> Should work!
(…)
KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')
>>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0]
Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))
>>> type(_)  # PySpark type, instead of Pandas!
pyspark.sql.types.Row
{code}

It would be much more convenient if the Spark dataframe did the conversion to 
Pandas more thoroughly.

  was:
The conversion of a *PySpark dataframe with nested columns* to Pandas (with 
`toPandas()`) does not convert nested columns into their Pandas equivalent, 
i.e. *columns indexed by a 
[MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.

For example, a dataframe with the following structure:
{code:java}
>>> df.printSchema()
root
 |-- device_ID: string (nullable = true)
 |-- time_origin_UTC: timestamp (nullable = true)
 |-- duration_s: integer (nullable = true)
 |-- session_time_UTC: timestamp (nullable = true)
 |-- probes_by_AP: struct (nullable = true)
 |    |-- aa:bb:cc:dd:ee:ff: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- delay_s: float (nullable = true)
 |    |    |    |-- RSSI: short (nullable = true)
 |-- max_RSSI_info_by_AP: struct (nullable = true)
 |    |-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
 |    |    |-- delay_s: float (nullable = true)
 |    |    |-- RSSI: short (nullable = true)
{code}
yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ 
nested inside Pandas (through a MultiIndex):
{code}
>>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # 
>>> Should work!
(…)
KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')

>>> sessions_in_period["max_RSSI_info_by_AP"].iloc[0]
Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))

>>> type(_)  # PySpark type, instead of Pandas!
pyspark.sql.types.Row
{code}

It would be much more convenient if the Spark dataframe did the conversion to 
Pandas more thoroughly.


> toPandas() should handle nested columns (as a Pandas MultiIndex)
> ----------------------------------------------------------------
>
>                 Key: SPARK-21537
>                 URL: https://issues.apache.org/jira/browse/SPARK-21537
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Eric O. LEBIGOT (EOL)
>              Labels: pandas
>
> The conversion of a *PySpark dataframe with nested columns* to Pandas (with 
> `toPandas()`) does not convert nested columns into their Pandas equivalent, 
> i.e. *columns indexed by a 
> [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.
> For example, a dataframe with the following structure:
> {code:java}
> >>> df.printSchema()
> root
>  |-- device_ID: string (nullable = true)
>  |-- time_origin_UTC: timestamp (nullable = true)
>  |-- duration_s: integer (nullable = true)
>  |-- session_time_UTC: timestamp (nullable = true)
>  |-- probes_by_AP: struct (nullable = true)
>  |    |-- aa:bb:cc:dd:ee:ff: array (nullable = true)
>  |    |    |-- element: struct (containsNull = true)
>  |    |    |    |-- delay_s: float (nullable = true)
>  |    |    |    |-- RSSI: short (nullable = true)
>  |-- max_RSSI_info_by_AP: struct (nullable = true)
>  |    |-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
>  |    |    |-- delay_s: float (nullable = true)
>  |    |    |-- RSSI: short (nullable = true)
> {code}
> yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ 
> nested inside Pandas (through a MultiIndex):
> {code}
> >>> df_pandas_version = df.toPandas()
> >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # 
> >>> Should work!
> (…)
> KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')
> >>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0]
> Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))
> >>> type(_)  # PySpark type, instead of Pandas!
> pyspark.sql.types.Row
> {code}
> It would be much more convenient if the Spark dataframe did the conversion to 
> Pandas more thoroughly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-21537) toPandas() should handle nested columns (as a Pandas MultiIndex)

Reply via email to