lazear commented on issue #5490: URL: https://github.com/apache/arrow-rs/issues/5490#issuecomment-1986985334
Thanks for the quick response. After doing some more benchmarking locally, I think it might be actually restricted to just streaming over the network. Reading from a local file and just dumping output to stdout, I can get approximately same time for both CSV/Parquet with both the rust code and python-polars (20-60 ms). Do you have any recommendations for row group size for this kind of use case? I am manually writing the parquet files using the low-level ColumnWriter API. - In terms of streaming, this is for an API integration/frontend datatable with pagination. I can try to see if it's possible to just only using streaming instead though, it's a good idea. - I am already using the object-store integration, but I haven't tried enabling the page index. - The filters I'm applying should result in consecutive runs of matching rows most of the time, but I can also apply them in application code after receiving the rows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
