[
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Leandro Ferrado updated SPARK-11758:
------------------------------------
Comment: was deleted
(was: Hi Holden. First, I would add just a single line in order to avoid the
bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False)
converts a Date column into a LongInt column). The idea is to first convert all
columns into string types, thus the function DataFrame.to_records(index=False)
wouldn't make bad conversions with datetime.datetime objects. However, that can
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of
strings or if we didn't define an schema (in that case, the function create an
schema of strings). So, the modification is only present on the condition
'schema=None' and the snippet would be:
-------
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
# begin if clause#
schema = [str(x) for x in data.columns]
data = data.astype(str) # Converting all fields on string objects
because we don't have a defined schema
# end if clause#
data = [r.tolist() for r in data.to_records(index=False)]
-------
In case of having an schema with timestamps (e.g. TimestampType() or
DateType()), it is needed a prior conversion between datetime.datetime objects
on Python to a convenient format for pyspark DataFrames.
Regarding to the 'index=False' term, so far I can't figure out an scenario in
which it is needed an index per row on a DataFrame. So it may be fine that
argument on the function, I'm not sure.)
> Missing Index column while creating a DataFrame from Pandas
> ------------------------------------------------------------
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
> Reporter: Leandro Ferrado
> Priority: Minor
> Original Estimate: 5h
> Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a
> pandas.DataFrame and indicating a 'schema' with StructFields, the function
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with
> temporal records.
> Doc:
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)] # HERE
> # ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]