Re: [I] Performance traps with arrow/parquet? [arrow-rs]

via GitHub Sat, 09 Mar 2024 13:28:21 -0800


lazear commented on issue #5490:
URL: https://github.com/apache/arrow-rs/issues/5490#issuecomment-1986985334


   Thanks for the quick response.
   
   After doing some more benchmarking locally, I think it might be actually 
restricted to just streaming over the network. Reading from a local file and 
just dumping output to stdout, I can get approximately same time for both 
CSV/Parquet with both the rust code and python-polars (20-60 ms).
   
   Do you have any recommendations for row group size for this kind of use 
case? I am manually writing the parquet files using the low-level ColumnWriter 
API.
   
   - In terms of streaming, this is for an API integration/frontend datatable 
with pagination. I can try to see if it's possible to just only using streaming 
instead though, it's a good idea.
   - I am already using the object-store integration, but I haven't tried 
enabling the page index. 
   - The filters I'm applying should result in consecutive runs of matching 
rows most of the time, but I can also apply them in application code after 
receiving the rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Performance traps with arrow/parquet? [arrow-rs]

Reply via email to