rjzamora commented on PR #34281: URL: https://github.com/apache/arrow/pull/34281#issuecomment-1439238620
>do you have an idea if this would affect dask? I didn't find that dask exposes this keyword in the to_parquet function explicitly, but I suppose it can be passed through. But I can imagine that if most users rely on the default, and if you are then reading with split_row_groups=True, that might affect dask users? Thanks for the ping @jorisvandenbossche ! If I understand correctly, this change is more likely to benefit Dask than it is to cause problems. When using `to_parquet(..., engine="pyarrow")`, users sometimes end up with oversized row-groups that cause memory errors during ETL. These users can already pass `row_group_size` through to mitigate this issue, but it would be nice if the default made OOM errors a bit less common. With that said, the size of a single row completely depends on the properties of the dataset. Therefore, I don't have any strong feelings about the row-count default. If it were up to me, pyarrow would fall back to a byte-size default, rather than a row-count default (like cudf). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
