[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929514#comment-16929514
 ] 

Wes McKinney commented on ARROW-5220:
-------------------------------------

Yikes, this is a "grab bag" of potential rough edges. 

I agree with your comment on the PR "I think it would be good to have the rule: 
if a schema is specified, it is the single source of truth about the schema, 
and you can be 100% sure that the resulting table will have this exact schema 
(otherwise an error is raised)"[

For your questions in particular

> are we OK with erroring if the index is not in the schema but would be 
> written as a column? And only if preserve_index=True, or also with 
> preserve_index=None in case the index is not a RangeIndex ?
This will break some current usage (but can probably do it with a deprecation 
first)

Yes I think this is okay. In a sense {{preserve_index}} has the function of 
informing the final schema. I think if {{preserve_index}} is None and the index 
is RangeIndex, then we can respect the schema and write the index as metadata

> We should follow the order of the columns in the schema, also for the index? 
> (currently the index is always appended to the other columns)

It does make the implementation more complex, but I think we should respect the 
schema

> What if an index is specified in the schema but preserve_index=False ?

Probably in this case should raise an exception. Thoughts?

> What if there are multiple index levels (a MultiIndex), but only one of them 
> is specified in the schema? (in case of columns, then that column that is not 
> the in the schema is ignored)

I would say raise an exception in this case. "In the face of ambiguity, refuse 
the temptation to guess"

> What if the index is specified in the schema, but is actually a RangeIndex 
> which would otherwise be serialized as metadata?

I think in such case then it should be serialized. Detecting this case is yet 
more complexity though =/

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -------------------------------------------------------------------------
>
>                 Key: ARROW-5220
>                 URL: https://issues.apache.org/jira/browse/ARROW-5220
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.15.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>                        ('a', pa.int64()),
>                        ('b', pa.float64()),
>                       ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to