[
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759571#comment-16759571
]
Hyukjin Kwon commented on SPARK-26810:
--------------------------------------
I'm leaving this as a duplicate of SPARK-23299. Thanks for reporting this with
detailed info.
> Fixing SPARK-25072 broke existing code and fails to show error message
> ----------------------------------------------------------------------
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.0
> Reporter: Arttu Voutilainen
> Priority: Minor
>
> Hey,
> We upgraded Spark recently, and
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail
> after the upgrade. Annoyingly, the error message formatting also threw an
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
> File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "<Row(%s)>" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()),
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The
> unexpected error came from __repr__ as it assumes that the arguments given to
> Row constructor are strings. That sounds like a reasonable assumption, should
> the Row constructor validate that it holds true maybe? (I guess that might be
> another potentially breaking change though, if someone has as weird code as
> this one...)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]