[jira] [Created] (SPARK-23009) PySpark should not assume Pandas cols are a basestring type

Bryan Cutler (JIRA) Tue, 09 Jan 2018 11:20:25 -0800

Bryan Cutler created SPARK-23009:
------------------------------------

             Summary: PySpark should not assume Pandas cols are a basestring 
type
                 Key: SPARK-23009
                 URL: https://issues.apache.org/jira/browse/SPARK-23009
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.3.0
            Reporter: Bryan Cutler



When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as 
input, Spark assumes that the columns will either be a {{str}} type or 
{{unicode}} type.  They can actually be any type that a dict can key off of.  
If they are not a {{basestr}} type, then a confusing AttributeError is thrown:

{{code}}
In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))

In [17]: pdf
Out[17]: 
          0         1
0  0.145171  0.482940
1  0.151336  0.299861
2  0.220338  0.830133
3  0.001659  0.513787

In [18]: pdf.columns
Out[18]: RangeIndex(start=0, stop=2, step=1)

In [19]: df = spark.createDataFrame(pdf)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-11bcb07e0e39> in <module>()
----> 1 df = spark.createDataFrame(pdf)

/home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, 
data, schema, samplingRatio, verifySchema)
    646             # If no schema supplied by user then get the names of 
columns only
    647             if schema is None:
--> 648                 schema = [x.encode('utf-8') if not isinstance(x, str) 
else x for x in data.columns]
    649 
    650             if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \

AttributeError: 'int' object has no attribute 'encode'
{{code}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23009) PySpark should not assume Pandas cols are a basestring type

Reply via email to