srowen commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala. URL: https://github.com/apache/spark/pull/23534#discussion_r247525774
########## File path: python/pyspark/sql/dataframe.py ########## @@ -109,15 +109,18 @@ def stat(self): @ignore_unicode_prefix @since(1.3) def toJSON(self, use_unicode=True): - """Converts a :class:`DataFrame` into a :class:`RDD` of string. + """Converts a :class:`DataFrame` into a :class:`DataFrame` of JSON string. - Each row is turned into a JSON document as one element in the returned RDD. + Each row is turned into a JSON document as one element in the returned DataFrame. >>> df.toJSON().first() - u'{"age":2,"name":"Alice"}' + Row(value=u'{"age":2,"name":"Alice"}') """ - rdd = self._jdf.toJSON() - return RDD(rdd.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) + jdf = self._jdf.toJSON() + if self.sql_ctx._conf.pysparkDataFrameToJSONShouldReturnDataFrame(): + return DataFrame(jdf, self.sql_ctx) + else: + return RDD(jdf.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode)) Review comment: I agree it's not clear which one is more consistent. I do not think we should add a config -- config options are, as someone once said, often a failure of design, that punts unresolved questions to the user. I think I'd slightly prefer keeping an RDD here if there's no compelling reason to significantly change the Pyspark API. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org