sven-weber-db commented on code in PR #55716:
URL: https://github.com/apache/spark/pull/55716#discussion_r3201466724


##########
python/pyspark/serializers.py:
##########
@@ -539,7 +541,7 @@ def loads(self, stream):
         elif length == SpecialLengths.NULL:
             return None
         s = stream.read(length)
-        return s.decode("utf-8") if self.use_unicode else s
+        return codecs.decode(s, "utf-8") if self.use_unicode else s

Review Comment:
   This change is required because we use the `ZeroCopyByteStream` class now to 
read the initial message. `ZeroCopyByteStream` returns `memoryview` objects 
instead of `bytes` or `bytearray`. `s.decode` is only implemented for `s` ∈ 
`{bytes, bytearray}`. 
   
   According to the Python documentation, for 
[`bytes.decode`](https://docs.python.org/3/library/stdtypes.html#bytes.decode) 
and 
[`codecs.decode`](https://docs.python.org/3/library/codecs.html#codecs.decode) 
the later implementation may behave different in case of errors.
   
   
[`bytes.decode`](https://docs.python.org/3/library/stdtypes.html#bytes.decode) 
states that
   
   > errors controls how decoding errors are handled. If 'strict' (the 
default), a 
[UnicodeError](https://docs.python.org/3/library/exceptions.html#UnicodeError) 
exception is raised.
   
   
[`codecs.decode`](https://docs.python.org/3/library/codecs.html#codecs.decode) 
states that
   
   > The default error handler is 'strict' meaning that decoding errors raise 
[ValueError](https://docs.python.org/3/library/exceptions.html#ValueError) (or 
a more codec specific subclass, such as 
[UnicodeDecodeError](https://docs.python.org/3/library/exceptions.html#UnicodeDecodeError))
   
   As decoding errors are unexpected and the specific Exception type that is 
thrown should not matter, I believe this change is safe. However, if there are 
concerns we can change this implementation to the following:
   
   ```python
   return bytes(s).decode("utf-8") if self.use_unicode else s
   ```
   
   This alternative implementation will invoke a memory copy to copy the 
`memoryview` contents into a `bytes` object before decoding. Given that this 
call is only invoked on deserialization of the initial message, this memory 
copy is probably acceptable as it is a one time cost. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to