[jira] [Created] (SPARK-31186) toPandas fails on simple query (collect() works)

Michael Chirico (Jira) Wed, 18 Mar 2020 22:03:13 -0700

Michael Chirico created SPARK-31186:
---------------------------------------


             Summary: toPandas fails on simple query (collect() works)
                 Key: SPARK-31186
                 URL: https://issues.apache.org/jira/browse/SPARK-31186
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.4
            Reporter: Michael Chirico


My pandas is 0.25.1.

I ran the following simple code (cross joins are enabled):

{code:python}
spark.sql('''
select t1.*, t2.* from (
  select explode(sequence(1, 3)) v
) t1 left join (
  select explode(sequence(1, 3)) v
) t2
''').toPandas()
{code}

and got a ValueError from pandas:

> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), 
> a.item(), a.any() or a.all().

Collect works fine:

{code:python}
spark.sql('''
select * from (
  select explode(sequence(1, 3)) v
) t1 left join (
  select explode(sequence(1, 3)) v
) t2
''').collect()
# [Row(v=1, v=1),
#  Row(v=1, v=2),
#  Row(v=1, v=3),
#  Row(v=2, v=1),
#  Row(v=2, v=2),
#  Row(v=2, v=3),
#  Row(v=3, v=1),
#  Row(v=3, v=2),
#  Row(v=3, v=3)]
{code}

I imagine it's related to the duplicate column names, but this doesn't fail:

{code:python}
spark.sql("select 1 v, 1 v").toPandas()
# v     v
# 0     1       1
{code}

Also no issue for multiple rows:

spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()

It also works when not using a cross join but a janky programatically-generated 
union all query:

{code:python}
cond = []
for ii in range(3):
    for jj in range(3):
        cond.append(f'select {ii+1} v, {jj+1} v')
spark.sql(' union all '.join(cond)).toPandas()
{code}

As near as I can tell, the output is identical to the explode output, making 
this issue all the more peculiar, as I thought toPandas() is applied to the 
output of collect(), so if collect() gives the same output, how can toPandas() 
fail in one case and not the other? Further, the lazy DataFrame is the same: 
DataFrame[v: int, v: int] in both cases. I must be missing something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31186) toPandas fails on simple query (collect() works)

Reply via email to