zero323 commented on a change in pull request #30322:
URL: https://github.com/apache/spark/pull/30322#discussion_r521355023



##########
File path: python/pyspark/sql/column.py
##########
@@ -906,7 +906,7 @@ def __nonzero__(self):
     __bool__ = __nonzero__
 
     def __repr__(self):
-        return 'Column<%s>' % self._jc.toString().encode('utf8')
+        return "Column<'%s'>" % self._jc.toString()

Review comment:
       > Do we have any more instances of decode()?
   
   We do a bit of encoding / decoding when we communicate with JVM, but purpose 
there is clear.
   
   The only other place when we encode strings intended for user consumption is 
`RDD.toDebugString`. It  also something that could be fixed, as it messing with 
the output a bit (as print won't respect line breaks).
   
   > Seems fine. Is it originally for non printable characters in unicode?
   
   I believe the point was to have `str` object as the output, instead of 
`unicode`. If I recall correctly, `unicode` (py4j returns JVM `Strings` as 
`unicode` in Python 2 and as a result the whole expression would evaluate to 
`unicode`)  in `__repr__`, wasn't handled correctly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to