theogaraj opened a new issue, #40758: URL: https://github.com/apache/arrow/issues/40758
### Describe the usage question you have. Please include as many useful details as possible. I'm using `pyarrow.dataset.dataset` and `pyarrow.dataset.write_dataset` to convert a newline-delimited (jsonl) file to parquet, and seeing very different end-to-end processing times for the following three approaches: 1. Let the `dataset` API handle all the filesystem details (223s) 2. Pass `dataset` an `s3fs.S3Filesystem` object (70s) 3. Use `smart_open` to handle download/upload from/to S3 and use `dataset` on local filesystem (30s) More detail with code snippets documented in [this StackOverflow question](https://stackoverflow.com/questions/78207687/pyarrow-dataset-s3-performance-different-with-pyarrow-filesystem-s3fs-indirect). From previous use of `pyarrow.parquet.ParquetFile` I know that options like `buffer_size` and `pre_buffer` can impact performance and I thought there might be similar options with the `dataset` API but I couldn't find anything in the documentation, would greatly appreciate some insight into this. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
