[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

Li Jin (JIRA) Fri, 28 Jul 2017 11:51:31 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105491#comment-16105491
 ]


Li Jin commented on ARROW-1291:
-------------------------------

The use case I have is that I am passing a user provided pandas dataframe to 
Spark using Arrow. In my particular case, I don't care about the name of the 
column in the pandas DataFrame because the column names are defined in the 
Spark's schema, so it's weird to ask for people to write out their column names 
in pandas and just to throw them away later...

I think it's more friendly behavior that to cast numeric columns to string than 
to throw this exception. My use case is a bit special that I don't care about 
the column names, so I could do the casting in my code. But I think other user 
might also find the current behavior surprising. 

I agree it's probably not worth it for arrow to preserve the numeric column 
names.

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-1291
>                 URL: https://issues.apache.org/jira/browse/ARROW-1291
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.5.0
>            Reporter: Li Jin
>            Priority: Minor
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError                                 Traceback (most recent call last)
> <ipython-input-5-670ba4a2ddb2> in <module>()
>       3 
>       4 df = pd.DataFrame([1])
> ----> 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
>     187                         arrow_type=arrow_type
>     188                     )
> --> 189                     for name, arrow_type in zip(df.columns, df_types)
>     190                 ] + (
>     191                     [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in <listcomp>(.0)
>     187                         arrow_type=arrow_type
>     188                     )
> --> 189                     for name, arrow_type in zip(df.columns, df_types)
>     190                 ] + (
>     191                     [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
>     125         raise TypeError(
>     126             'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127                 name, type(name).__name__
>     128             )
>     129         )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

Reply via email to