[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

Phillip Cloud (JIRA) Fri, 28 Jul 2017 14:38:30 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105728#comment-16105728
 ]


Phillip Cloud commented on ARROW-1291:
--------------------------------------

That could work, but then the round trip conversion is no longer exact.

It seems like the choice is "where should the surprise be?" or maybe "what's 
least surprising to users?" and that there are three options.

# Leave the behavior as is, and users of arrow need to handle their own input 
columns before sending dataframes to arrow. This is the current behavior.
# Add casting to strings in one direction, when the input is a dataframe with 
numeric columns. This gives IMO behavior that is more surprising than an error: 
when you call {{.to_pandas()}} you get back something different than what you 
put in. It's also not easy to tell that it's different by looking at the 
dataframe because of the way dataframes repr.
# Add enough metadata in to preserve the current round trip behavior.

I favor #1 the most and #3 if we decide it really is necessary to allow numeric 
columns. With 3 we still lose some compatibility with other systems that want 
to read and write data that came from dataframes unless those systems want to 
handle integer columns.

I think #2 isn't a great option because it results in behavior in the public 
API that isn't obvious unless you know something about how both arrow and 
pandas work.

Additionally, we can't just call {{str}} on every column and be done, we have 
to make additional decisions like do we allow mixed string and integer column 
names? Though, maybe that's a red herring and we can just say "{{Int64Index}} s 
only" though we still have to make that decision as well.

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-1291
>                 URL: https://issues.apache.org/jira/browse/ARROW-1291
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.5.0
>            Reporter: Li Jin
>            Priority: Minor
>             Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError                                 Traceback (most recent call last)
> <ipython-input-5-670ba4a2ddb2> in <module>()
>       3 
>       4 df = pd.DataFrame([1])
> ----> 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
>     187                         arrow_type=arrow_type
>     188                     )
> --> 189                     for name, arrow_type in zip(df.columns, df_types)
>     190                 ] + (
>     191                     [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in <listcomp>(.0)
>     187                         arrow_type=arrow_type
>     188                     )
> --> 189                     for name, arrow_type in zip(df.columns, df_types)
>     190                 ] + (
>     191                     [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
>     125         raise TypeError(
>     126             'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127                 name, type(name).__name__
>     128             )
>     129         )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

Reply via email to