Xinrong Meng created SPARK-39550:
------------------------------------
Summary: Fix `MultiIndex.value_counts()` when Arrow Execution is
enabled
Key: SPARK-39550
URL: https://issues.apache.org/jira/browse/SPARK-39550
Project: Spark
Issue Type: Bug
Components: Pandas API on Spark, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng
When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'} 1
{'__index_level_0__': 2, '__index_level_1__': 'b'} 1
dtype: int64
{code}
When Arrow Execution is disabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a) 1
(2, b) 1
dtype: int64 {code}
Notice how indexes of their results are different.
Especially, `value_counts` returns a Index (rather than a MultiIndex), under
the hood, a Spark column of StructType (rather than multiple Spark columns), so
when Arrow Execution is enabled, Arrow converts the StructType column to a
dictionary, where we expect a tuple instad.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]