[
https://issues.apache.org/jira/browse/SPARK-31186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-31186.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 28025
[https://github.com/apache/spark/pull/28025]
> toPandas fails on simple query (collect() works)
> ------------------------------------------------
>
> Key: SPARK-31186
> URL: https://issues.apache.org/jira/browse/SPARK-31186
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.4
> Reporter: Michael Chirico
> Assignee: L. C. Hsieh
> Priority: Minor
> Fix For: 3.0.0
>
>
> My pandas is 0.25.1.
> I ran the following simple code (cross joins are enabled):
> {code:python}
> spark.sql('''
> select t1.*, t2.* from (
> select explode(sequence(1, 3)) v
> ) t1 left join (
> select explode(sequence(1, 3)) v
> ) t2
> ''').toPandas()
> {code}
> and got a ValueError from pandas:
> > ValueError: The truth value of a Series is ambiguous. Use a.empty,
> > a.bool(), a.item(), a.any() or a.all().
> Collect works fine:
> {code:python}
> spark.sql('''
> select * from (
> select explode(sequence(1, 3)) v
> ) t1 left join (
> select explode(sequence(1, 3)) v
> ) t2
> ''').collect()
> # [Row(v=1, v=1),
> # Row(v=1, v=2),
> # Row(v=1, v=3),
> # Row(v=2, v=1),
> # Row(v=2, v=2),
> # Row(v=2, v=3),
> # Row(v=3, v=1),
> # Row(v=3, v=2),
> # Row(v=3, v=3)]
> {code}
> I imagine it's related to the duplicate column names, but this doesn't fail:
> {code:python}
> spark.sql("select 1 v, 1 v").toPandas()
> # v v
> # 0 1 1
> {code}
> Also no issue for multiple rows:
> spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()
> It also works when not using a cross join but a janky
> programatically-generated union all query:
> {code:python}
> cond = []
> for ii in range(3):
> for jj in range(3):
> cond.append(f'select {ii+1} v, {jj+1} v')
> spark.sql(' union all '.join(cond)).toPandas()
> {code}
> As near as I can tell, the output is identical to the explode output, making
> this issue all the more peculiar, as I thought toPandas() is applied to the
> output of collect(), so if collect() gives the same output, how can
> toPandas() fail in one case and not the other? Further, the lazy DataFrame is
> the same: DataFrame[v: int, v: int] in both cases. I must be missing
> something.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]