HyukjinKwon opened a new pull request #25749: [SPARK-27995][PYTHON] Allows 
createDataFrame to accept bytes as binary type
URL: https://github.com/apache/spark/pull/25749
 
 
   ### What changes were proposed in this pull request?
   
   This PR proposes to allow `bytes` as an acceptable type for binary type for 
`createDataFrame`.
   
   ### Why are the changes needed?
   
   `bytes` is a standard type for binary in Python. This should be respected in 
PySpark side.
   
   ### Does this PR introduce any user-facing change?
   
   Yes, _when specified type is binary_, we will allow `bytes` as a binary 
type. Previously this was not allowed in both Python 2 and Python 3 as below:
   
   in Python 3
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../spark/python/pyspark/sql/session.py", line 787, in 
createDataFrame
       rdd, schema = self._createFromLocal(map(prepare, data), schema)
     File "/.../spark/python/pyspark/sql/session.py", line 442, in 
_createFromLocal
       data = list(data)
     File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
       verify_func(obj)
     File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
       verify_value(obj)
     File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
       verifier(v)
     File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
       verify_value(obj)
     File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
       verify_acceptable_types(obj)
     File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
verify_acceptable_types
       % (dataType, obj, type(obj))))
   TypeError: field col: BinaryType can not accept object b'abcd' in type 
<class 'bytes'>
   ```
   
   in Python 2:
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../spark/python/pyspark/sql/session.py", line 787, in 
createDataFrame
       rdd, schema = self._createFromLocal(map(prepare, data), schema)
     File "/.../spark/python/pyspark/sql/session.py", line 442, in 
_createFromLocal
       data = list(data)
     File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
       verify_func(obj)
     File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
       verify_value(obj)
     File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
       verifier(v)
     File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
       verify_value(obj)
     File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
       verify_acceptable_types(obj)
     File "/.../spark/python/pyspark/sql/types.py", line 1282, in 
verify_acceptable_types
       % (dataType, obj, type(obj))))
   TypeError: field col: BinaryType can not accept object 'abcd' in type <type 
'str'>
   ```
   
   So, it won't break anything.
   
   ### How was this patch tested?
   
   Unittests were added and also manually tested as below.
   
   ```bash
   ./run-tests --python-executables=python2,python3 --testnames 
"pyspark.sql.tests.test_serde"
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to