[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

Phillip Cloud (JIRA) Fri, 28 Jul 2017 10:04:29 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105308#comment-16105308
 ]


Phillip Cloud commented on ARROW-1291:
--------------------------------------

I'm -1 on allowing numeric column names since it adds an IMO unnecessary 
coupling to pandas semantics. With such a change, any tool that wants to read 
data out of an arrow array must now consider the origin of the data's column 
names, and cannot simply assume that the columns in the schema are always a 
simple list of strings. I don't think it's easy to make this behavior 
transparent to tools that use arrow, while OTOH a list of strings is easy to 
deal with in pretty much any system that arrow is a part of or will be a part 
of.

Since this is really only useful when doing pandas -> arrow -> pandas, and 
users of pandas can already refer to columns by positional index with {{.iloc}} 
I'm not convinced we should allow this.

I think adding metadata for indexes has less far-reaching effects because it's 
an optional feature of pandas that isn't a core part of arrow, while column 
names are non-negotiable.

I don't think it's too much to ask people to explicitly write out their column 
names as strings.

I *am* willing to be convinced though :)

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-1291
>                 URL: https://issues.apache.org/jira/browse/ARROW-1291
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.5.0
>            Reporter: Li Jin
>            Priority: Minor
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError                                 Traceback (most recent call last)
> <ipython-input-5-670ba4a2ddb2> in <module>()
>       3 
>       4 df = pd.DataFrame([1])
> ----> 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
>     187                         arrow_type=arrow_type
>     188                     )
> --> 189                     for name, arrow_type in zip(df.columns, df_types)
>     190                 ] + (
>     191                     [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in <listcomp>(.0)
>     187                         arrow_type=arrow_type
>     188                     )
> --> 189                     for name, arrow_type in zip(df.columns, df_types)
>     190                 ] + (
>     191                     [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
>     125         raise TypeError(
>     126             'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127                 name, type(name).__name__
>     128             )
>     129         )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

Reply via email to