[
https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557145#comment-17557145
]
Xinrong Meng commented on SPARK-39550:
--------------------------------------
I am working on that.
> Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
> ---------------------------------------------------------------
>
> Key: SPARK-39550
> URL: https://issues.apache.org/jira/browse/SPARK-39550
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark, PySpark
> Affects Versions: 3.4.0
> Reporter: Xinrong Meng
> Priority: Major
>
>
> When Arrow Execution is enabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'true'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> {'__index_level_0__': 1, '__index_level_1__': 'a'} 1
> {'__index_level_0__': 2, '__index_level_1__': 'b'} 1
> dtype: int64
> {code}
> When Arrow Execution is disabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'false'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> (1, a) 1
> (2, b) 1
> dtype: int64 {code}
> Notice how indexes of their results are different.
> Especially, `value_counts` returns an Index (rather than a MultiIndex), under
> the hood, a Spark column of StructType (rather than multiple Spark columns),
> so when Arrow Execution is enabled, Arrow converts the StructType column to a
> dictionary, where we expect a tuple instead.
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]