[
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105728#comment-16105728
]
Phillip Cloud commented on ARROW-1291:
--------------------------------------
That could work, but then the round trip conversion is no longer exact.
It seems like the choice is "where should the surprise be?" or maybe "what's
least surprising to users?" and that there are three options.
# Leave the behavior as is, and users of arrow need to handle their own input
columns before sending dataframes to arrow. This is the current behavior.
# Add casting to strings in one direction, when the input is a dataframe with
numeric columns. This gives IMO behavior that is more surprising than an error:
when you call {{.to_pandas()}} you get back something different than what you
put in. It's also not easy to tell that it's different by looking at the
dataframe because of the way dataframes repr.
# Add enough metadata in to preserve the current round trip behavior.
I favor #1 the most and #3 if we decide it really is necessary to allow numeric
columns. With 3 we still lose some compatibility with other systems that want
to read and write data that came from dataframes unless those systems want to
handle integer columns.
I think #2 isn't a great option because it results in behavior in the public
API that isn't obvious unless you know something about how both arrow and
pandas work.
Additionally, we can't just call {{str}} on every column and be done, we have
to make additional decisions like do we allow mixed string and integer column
names? Though, maybe that's a red herring and we can just say "{{Int64Index}} s
only" though we still have to make that decision as well.
> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric
> column names
> --------------------------------------------------------------------------------------
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.5.0
> Reporter: Li Jin
> Priority: Minor
> Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
> <ipython-input-5-670ba4a2ddb2> in <module>()
> 3
> 4 df = pd.DataFrame([1])
> ----> 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in <listcomp>(.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)