Leandro Ferrado created SPARK-11758:
---------------------------------------

             Summary: Missing Index column while creating a DataFrame from 
Pandas 
                 Key: SPARK-11758
                 URL: https://issues.apache.org/jira/browse/SPARK-11758
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.5.1
         Environment: Linux Debian, PySpark, in local testing.
            Reporter: Leandro Ferrado
            Priority: Minor


In PySpark's SQLContext, when it invokes createDataFrame() from a 
pandas.DataFrame and indicating a 'schema' with StructFields, the function 
_createFromLocal() converts the pandas.DataFrame but ignoring two points:
- Index column, because the flag index=False
- Timestamp's records, because a Date column can't be index and Pandas doesn't 
converts its records in Timestamp's type.
So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
temporal records.

Doc: 
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html

Affected code:

def _createFromLocal(self, data, schema):
        """
        Create an RDD for DataFrame from an list or pandas.DataFrame, returns
        the RDD and schema.
        """
        if has_pandas and isinstance(data, pandas.DataFrame):
            if schema is None:
                schema = [str(x) for x in data.columns]
            data = [r.tolist() for r in data.to_records(index=False)]  # HERE
        # ...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to