[GitHub] [arrow] rjzamora commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

via GitHub Tue, 21 Feb 2023 15:51:18 -0800


rjzamora commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1439238620


   >do you have an idea if this would affect dask? I didn't find that dask 
exposes this keyword in the to_parquet function explicitly, but I suppose it 
can be passed through. But I can imagine that if most users rely on the 
default, and if you are then reading with split_row_groups=True, that might 
affect dask users?
   
   Thanks for the ping @jorisvandenbossche !  If I understand correctly, this 
change is more likely to benefit Dask than it is to cause problems. When using 
`to_parquet(..., engine="pyarrow")`, users sometimes end up with oversized 
row-groups that cause memory errors during ETL. These users can already pass 
`row_group_size` through to mitigate this issue, but it would be nice if the 
default made OOM errors a bit less common.  With that said, the size of a 
single row completely depends on the properties of the dataset. Therefore, I 
don't have any strong feelings about the row-count default. If it were up to 
me, pyarrow would fall back to a byte-size default, rather than a row-count 
default (like cudf).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rjzamora commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Reply via email to