Eric O. LEBIGOT (EOL) created SPARK-21537:
---------------------------------------------
Summary: toPandas() should handle nested columns (as a Pandas
MultiIndex)
Key: SPARK-21537
URL: https://issues.apache.org/jira/browse/SPARK-21537
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 2.2.0
Reporter: Eric O. LEBIGOT (EOL)
The conversion of a *PySpark dataframe with nested columns* to Pandas (with
`toPandas()`) does not convert nested columns into their Pandas equivalent,
i.e. *columns indexed by a
[MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.
For example, a dataframe with the following structure:
{code:java}
>>> df.printSchema()
root
|-- device_ID: string (nullable = true)
|-- time_origin_UTC: timestamp (nullable = true)
|-- duration_s: integer (nullable = true)
|-- session_time_UTC: timestamp (nullable = true)
|-- probes_by_AP: struct (nullable = true)
| |-- aa:bb:cc:dd:ee:ff: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- delay_s: float (nullable = true)
| | | |-- RSSI: short (nullable = true)
|-- max_RSSI_info_by_AP: struct (nullable = true)
| |-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
| | |-- delay_s: float (nullable = true)
| | |-- RSSI: short (nullable = true)
{code}
yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_
nested inside Pandas (through a MultiIndex):
{code}
>>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. #
>>> Should work!
(…)
KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')
>>> sessions_in_period["max_RSSI_info_by_AP"].iloc[0]
Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))
>>> type(_) # PySpark type, instead of Pandas!
pyspark.sql.types.Row
{code}
It would be much more convenient if the Spark dataframe did the conversion to
Pandas more thoroughly.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]