[
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498183#comment-16498183
]
Aldrin commented on ARROW-2659:
-------------------------------
It seems to me that the following locations are the CPP code underlying the
cython code:
* [arrow::Table in
cpp/src/arrow/table.cc|https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/cpp/src/arrow/table.cc#L436]
* [arrow::Schema in
cpp/src/arrow/type.cc|https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/cpp/src/arrow/type.cc#L291]
but since I don't actually know cython, I can't tell if this is the code that
is eventually invoked when calling
[pyarrow.table.pxi:concat_table()|https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/table.pxi#L1384]
> [Python] More graceful reading of empty String columns in ParquetDataset
> ------------------------------------------------------------------------
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.9.0
> Reporter: Uwe L. Korn
> Priority: Major
> Labels: beginner
> Fix For: 0.11.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt,
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get
> consistent schemas, even if the source was a single DataFrame. This is due to
> the fact that in some partitions object columns like string can become empty.
> Then the resulting Arrow schema will differ. In the central metadata, we will
> store this column as {{pa.string}} whereas in the partition file with the
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we
> should respect that in
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
> Instead of doing a {{pa.Schema.equals}} in
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
> we should introduce a new method {{pa.Schema.can_evolve_to}} that is more
> graceful and returns {{True}} if a dataset piece has a null column where the
> main metadata states a nullable column of any type.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)