[ 
https://issues.apache.org/jira/browse/ARROW-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Angus Hollands updated ARROW-13153:
-----------------------------------
    Description: 
Hi all, thanks for the useful library!

I noticed when calling {{pyarrow.dataset.parquet_dataset}}
 that the order of the files ({{dataset.files}}) does not match that which is 
stored in {{_metadata}} via the 
{{metadata.row_group\(i).column\(0).file_path}}. I'm not an Arrow expert by any 
means, but is this intentional?

I think the unordered map is the culprit, but I have not recompiled to test 
this theory. 
[https://github.com/apache/arrow/blob/133b1a904bf7fc1d24343c306a2279e27d4ebe6d/cpp/src/arrow/dataset/file_parquet.cc#L870]

  was:
Hi all, thanks for the useful library!

I noticed when calling {{pyarrow.dataset.parquet_dataset}}
 that the order of the files ({{dataset.files}}) does not match that which is 
stored in {{_metadata}} via the {{metadata.row_group(i).column(0).file_path}}. 
I'm not an Arrow expert by any means, but is this intentional?

I think the unordered map is the culprit, but I have not recompiled to test 
this theory. 
[https://github.com/apache/arrow/blob/133b1a904bf7fc1d24343c306a2279e27d4ebe6d/cpp/src/arrow/dataset/file_parquet.cc#L870]


> `parquet_dataset` loses ordering of files in `_metadata`
> --------------------------------------------------------
>
>                 Key: ARROW-13153
>                 URL: https://issues.apache.org/jira/browse/ARROW-13153
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet
>            Reporter: Angus Hollands
>            Priority: Major
>
> Hi all, thanks for the useful library!
> I noticed when calling {{pyarrow.dataset.parquet_dataset}}
>  that the order of the files ({{dataset.files}}) does not match that which is 
> stored in {{_metadata}} via the 
> {{metadata.row_group\(i).column\(0).file_path}}. I'm not an Arrow expert by 
> any means, but is this intentional?
> I think the unordered map is the culprit, but I have not recompiled to test 
> this theory. 
> [https://github.com/apache/arrow/blob/133b1a904bf7fc1d24343c306a2279e27d4ebe6d/cpp/src/arrow/dataset/file_parquet.cc#L870]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to