Hyukjin Kwon created SPARK-27995: ------------------------------------ Summary: Note the difference between str of Python 2 and 3 at Arrow optimized toPandas Key: SPARK-27995 URL: https://issues.apache.org/jira/browse/SPARK-27995 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Hyukjin Kwon
When Arrow optimization is enabled in Python 2.7, {code} import pandas pdf = pandas.DataFrame(["test1", "test2"]) df = spark.createDataFrame(pdf) df.show() {code} I got the following output: {code} +----------------+ | 0| +----------------+ |[74 65 73 74 31]| |[74 65 73 74 32]| +----------------+``` {code} This looks because Python's {{str}} and {{byte}} are same. it does look right: {code} >>> str == bytes True >>> isinstance("a", bytes) True {code} 1. Python 2 treats `str` as `bytes`. 2. PySpark added some special codes and hacks to recognizes `str` as string types. 3. PyArrow / Pandas followed Python 2 difference We might have to match the behaviour to PySpark's but Python 2 is deprecated anyway. I think it's better to just note it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org