[
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929514#comment-16929514
]
Wes McKinney commented on ARROW-5220:
-------------------------------------
Yikes, this is a "grab bag" of potential rough edges.
I agree with your comment on the PR "I think it would be good to have the rule:
if a schema is specified, it is the single source of truth about the schema,
and you can be 100% sure that the resulting table will have this exact schema
(otherwise an error is raised)"[
For your questions in particular
> are we OK with erroring if the index is not in the schema but would be
> written as a column? And only if preserve_index=True, or also with
> preserve_index=None in case the index is not a RangeIndex ?
This will break some current usage (but can probably do it with a deprecation
first)
Yes I think this is okay. In a sense {{preserve_index}} has the function of
informing the final schema. I think if {{preserve_index}} is None and the index
is RangeIndex, then we can respect the schema and write the index as metadata
> We should follow the order of the columns in the schema, also for the index?
> (currently the index is always appended to the other columns)
It does make the implementation more complex, but I think we should respect the
schema
> What if an index is specified in the schema but preserve_index=False ?
Probably in this case should raise an exception. Thoughts?
> What if there are multiple index levels (a MultiIndex), but only one of them
> is specified in the schema? (in case of columns, then that column that is not
> the in the schema is ignored)
I would say raise an exception in this case. "In the face of ambiguity, refuse
the temptation to guess"
> What if the index is specified in the schema, but is actually a RangeIndex
> which would otherwise be serialized as metadata?
I think in such case then it should be serialized. Detecting this case is yet
more complexity though =/
> [Python] index / unknown columns in specified schema in Table.from_pandas
> -------------------------------------------------------------------------
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Minor
> Labels: pull-request-available
> Fix For: 0.15.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
> ('a', pa.int64()),
> ('b', pa.float64()),
> ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names"
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work:
> {{table1 = pa.Table.from_pandas(df1); table2 = pa.Table.from_pandas(df2,
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from
> the code in example in ARROW-3861). So before, unknown columns in the
> specified schema were ignored, while now they raise an error. Was this a
> conscious change?
> So before also specifying the index in the schema "worked" in the sense that
> it didn't raise an error, but it was also ignored, so didn't actually do what
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed
> {{schema}} ? So that the example above works (although this might be
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are
> not in the DataFrame? Are we fine with keep raising an error as it is now
> (the error message could be improved then)? Or do we again want to ignore
> them? (or, it could actually also add them as all nulls to the table)
--
This message was sent by Atlassian Jira
(v8.3.2#803003)