Joris Van den Bossche created ARROW-7702:
--------------------------------------------

             Summary: [C++][Dataset] Provide (optional) deterministic order of 
batches
                 Key: ARROW-7702
                 URL: https://issues.apache.org/jira/browse/ARROW-7702
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++ - Dataset, Python
            Reporter: Joris Van den Bossche


Example with python:

{code}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': range(12)}) 
pq.write_table(table, "test_chunks.parquet", chunk_size=3) 

# reading with dataset
import pyarrow.dataset as ds
ds.dataset("test_chunks.parquet").to_table().to_pandas()
{code}

gives non-deterministic result (order of the row groups in the parquet file):

```
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()               
                                                                                
                                                   
Out[25]: 
     a
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11

In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()               
                                                                                
                                                   
Out[26]: 
     a
0    0
1    1
2    2
3    3
4    8
5    9
6   10
7   11
8    4
9    5
10   6
11   7

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to