[jira] [Assigned] (ARROW-8087) [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema

Ben Kietzman (Jira) Thu, 12 Mar 2020 05:17:11 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ben Kietzman reassigned ARROW-8087:
-----------------------------------

    Assignee: Ben Kietzman

> [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-8087
>                 URL: https://issues.apache.org/jira/browse/ARROW-8087
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++ - Dataset
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>             Fix For: 0.17.0
>
>
> Currently, when reading a partitioned dataset with hive partitioning, it 
> seems that the partition columns get sorted alphabetically when appending 
> them to the schema (while the old ParquetDataset implementation keeps the 
> order as it is present in the paths).  
> For a regular partitioning this order is consistent for all fragments.
> So for example for the typical NYC Taxi data example, with datasets, the 
> schema ends with columns "month, year", while the ParquetDataset appends them 
> as "year, month".
> Python example:
> {code}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
>     'foo': np.array(foo_keys, dtype='i4').repeat(15),
>     'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
>     'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> {code}
> >>> pq.read_table("test_order").schema
> values: double
> foo: dictionary<values=int64, indices=int32, ordered=0>
> bar: dictionary<values=string, indices=int32, ordered=0>
> >>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
> values: double
> bar: string
> foo: int32
> {code}
> so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something 
> else)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8087) [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema

Reply via email to