HyukjinKwon commented on a change in pull request #23534: [SPARK-26610][PYTHON]
Fix inconsistency between toJSON Method in Python and Scala.
URL: https://github.com/apache/spark/pull/23534#discussion_r247455277
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -109,15 +109,18 @@ def stat(self):
@ignore_unicode_prefix
@since(1.3)
def toJSON(self, use_unicode=True):
- """Converts a :class:`DataFrame` into a :class:`RDD` of string.
+ """Converts a :class:`DataFrame` into a :class:`DataFrame` of JSON
string.
- Each row is turned into a JSON document as one element in the returned
RDD.
+ Each row is turned into a JSON document as one element in the returned
DataFrame.
>>> df.toJSON().first()
- u'{"age":2,"name":"Alice"}'
+ Row(value=u'{"age":2,"name":"Alice"}')
"""
- rdd = self._jdf.toJSON()
- return RDD(rdd.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode))
+ jdf = self._jdf.toJSON()
+ if self.sql_ctx._conf.pysparkDataFrameToJSONShouldReturnDataFrame():
+ return DataFrame(jdf, self.sql_ctx)
+ else:
+ return RDD(jdf.toJavaRDD(), self._sc,
UTF8Deserializer(use_unicode))
Review comment:
The problem is, the API usage is different. You can directly do, for
instance, `map` on what `toJSON` returns for instance.
```python
spark.range(1).toJSON().map(lambda value: value + "abc")
```
now it should be something like
```python
spark.range(1).toJSON().selectExpr("concat(_1, 'abc')")
```
This is pretty inconsistent with Scala side. I don't think we could say this
is consistent since Dataset itself can be a replacement of an RDD but DataFrame
is not.
For instance, `DataFrameReader.csv` or `DataFrameReader.json` doesn't accept
`DataFrame` as an input because Scala side takes `Dataset[String]`. It only
takes RDD.
```python
>>> spark.read.json(spark.range(1).toJSON()).show()
```
```
+---+
| id|
+---+
| 0|
+---+
```
```python
>>> spark.read.json(spark.range(1)).show()
```
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/readwriter.py", line 293, in json
raise TypeError("path can be only string, list or RDD")
TypeError: path can be only string, list or RDD
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]