[
https://issues.apache.org/jira/browse/ARROW-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben Kietzman reassigned ARROW-8087:
-----------------------------------
Assignee: Ben Kietzman
> [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema
> ------------------------------------------------------------------------------
>
> Key: ARROW-8087
> URL: https://issues.apache.org/jira/browse/ARROW-8087
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++ - Dataset
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Fix For: 0.17.0
>
>
> Currently, when reading a partitioned dataset with hive partitioning, it
> seems that the partition columns get sorted alphabetically when appending
> them to the schema (while the old ParquetDataset implementation keeps the
> order as it is present in the paths).
> For a regular partitioning this order is consistent for all fragments.
> So for example for the typical NYC Taxi data example, with datasets, the
> schema ends with columns "month, year", while the ParquetDataset appends them
> as "year, month".
> Python example:
> {code}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> {code}
> >>> pq.read_table("test_order").schema
> values: double
> foo: dictionary<values=int64, indices=int32, ordered=0>
> bar: dictionary<values=string, indices=int32, ordered=0>
> >>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
> values: double
> bar: string
> foo: int32
> {code}
> so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something
> else)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)