[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759571#comment-16759571
 ] 

Hyukjin Kwon commented on SPARK-26810:
--------------------------------------

I'm leaving this as a duplicate of SPARK-23299. Thanks for reporting this with 
detailed info.

> Fixing SPARK-25072 broke existing code and fails to show error message
> ----------------------------------------------------------------------
>
>                 Key: SPARK-26810
>                 URL: https://issues.apache.org/jira/browse/SPARK-26810
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Arttu Voutilainen
>            Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
>     "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
>     return "<Row(%s)>" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to