[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863260#comment-16863260
 ] 

Benjamin Kietzman edited comment on ARROW-5220 at 6/13/19 4:49 PM:
-------------------------------------------------------------------

I agree that the schema should be the single point of truth and it seems most 
reasonable to raise an error when a field in the schema does not correspond to 
a column in the DataFrame.

Would it be an acceptable solution to require {{preserve_index=True}} when 
specifying the type of the index with a schema?

IE, ensure the following works as you intend (currently fails):
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}

This would disambiguate when a name should be removed from the schema because 
it refers to the dataframe's index.


was (Author: bkietz):
Would it be an acceptable solution to require {{preserve_index=True}} when 
specifying the type of the index with a schema?

IE, ensure the following works as you intend (currently fails):
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}

This would disambiguate when a name should be removed from the schema because 
it refers to the dataframe's index

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -------------------------------------------------------------------------
>
>                 Key: ARROW-5220
>                 URL: https://issues.apache.org/jira/browse/ARROW-5220
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>                        ('a', pa.int64()),
>                        ('b', pa.float64()),
>                       ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to