westonpace commented on issue #33972: URL: https://github.com/apache/arrow/issues/33972#issuecomment-1413788554
The datasets feature went through considerable change a while back when it moved from a parquet-only feature to format-agnostic. Looks like this connection came loose in the conversion. If you just want to read one file the approach is normally something more like: ``` import pyarrow.parquet as pq pq.read_table(path) ``` If you're looking to read a collection of files you would normally use: ``` import pyarrow.dataset as ds ds.dataset([paths]).to_table() ``` I suspect (though am not entirely certain) both of the above paths will only read the metadata once. However, your usage is legitimate, and it even affects the normal datasets path when you scan the dataset multiple times (because we should be caching the metadata on the first scan and reusing on the second). So I would consider this a bug. I don't know for sure but my guess is the problem is [here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L364). The fragment is opening a reader and should pass the metadata to the reader, if already populated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
