[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

via GitHub Thu, 02 Feb 2023 05:57:54 -0800


westonpace commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1413788554


   The datasets feature went through considerable change a while back when it 
moved from a parquet-only feature to format-agnostic.  Looks like this 
connection came loose in the conversion.  If you just want to read one file the 
approach is normally something more like:
   
   ```
   import pyarrow.parquet as pq
   pq.read_table(path)
   ```
   
   If you're looking to read a collection of files you would normally use:
   
   ```
   import pyarrow.dataset as ds
   ds.dataset([paths]).to_table()
   ```
   
   I suspect (though am not entirely certain) both of the above paths will only 
read the metadata once.
   
   However, your usage is legitimate, and it even affects the normal datasets 
path when you scan the dataset multiple times (because we should be caching the 
metadata on the first scan and reusing on the second).  So I would consider 
this a bug.
   
   I don't know for sure but my guess is the problem is 
[here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L364).
  The fragment is opening a reader and should pass the metadata to the reader, 
if already populated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

Reply via email to