Bryan Cutler created SPARK-23009: ------------------------------------ Summary: PySpark should not assume Pandas cols are a basestring type Key: SPARK-23009 URL: https://issues.apache.org/jira/browse/SPARK-23009 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler
When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark assumes that the columns will either be a {{str}} type or {{unicode}} type. They can actually be any type that a dict can key off of. If they are not a {{basestr}} type, then a confusing AttributeError is thrown: {{code}} In [16]: pdf = pd.DataFrame(np.random.rand(4, 2)) In [17]: pdf Out[17]: 0 1 0 0.145171 0.482940 1 0.151336 0.299861 2 0.220338 0.830133 3 0.001659 0.513787 In [18]: pdf.columns Out[18]: RangeIndex(start=0, stop=2, step=1) In [19]: df = spark.createDataFrame(pdf) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-18-11bcb07e0e39> in <module>() ----> 1 df = spark.createDataFrame(pdf) /home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema) 646 # If no schema supplied by user then get the names of columns only 647 if schema is None: --> 648 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns] 649 650 if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \ AttributeError: 'int' object has no attribute 'encode' {{code}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org