[I] Why is pyarrow.dataset direct from S3 so much slower than using dataset locally and upload/download separately? [arrow]

via GitHub Fri, 22 Mar 2024 18:45:20 -0700


theogaraj opened a new issue, #40758:
URL: https://github.com/apache/arrow/issues/40758


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I'm using `pyarrow.dataset.dataset` and `pyarrow.dataset.write_dataset` to 
convert a newline-delimited (jsonl) file to parquet, and seeing very different 
end-to-end processing times for the following three approaches:
   
   1. Let the `dataset` API handle all the filesystem details (223s)
   2. Pass `dataset` an `s3fs.S3Filesystem` object (70s)
   3. Use `smart_open` to handle download/upload from/to S3 and use `dataset` 
on local filesystem (30s)
   
   More detail with code snippets documented in [this StackOverflow 
question](https://stackoverflow.com/questions/78207687/pyarrow-dataset-s3-performance-different-with-pyarrow-filesystem-s3fs-indirect).
   
   From previous use of `pyarrow.parquet.ParquetFile` I know that options like 
`buffer_size` and `pre_buffer` can impact performance and I thought there might 
be similar options with the `dataset` API but I couldn't find anything in the 
documentation, would greatly appreciate some insight into this.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Why is pyarrow.dataset direct from S3 so much slower than using dataset locally and upload/download separately? [arrow]

Reply via email to