kosiew opened a new pull request, #1367:
URL: https://github.com/apache/datafusion-python/pull/1367
## Which issue does this PR close?
* Closes #1362.
## Rationale for this change
Large DataFrames could ignore the configured `max_memory_bytes` limit during
display. With the prior defaults (`repr_rows=10`, `min_rows_display=20`), the
collection loop would *always* continue until at least `min_rows_display` rows
were gathered, even when the memory limit had already been exceeded, because
`min_rows_display` was greater than the intended row cap.
This PR fixes the root cause by:
* Making the minimum-rows safeguard compatible with the maximum-rows cap.
* Clarifying configuration intent by renaming `repr_rows` to `max_rows`.
## What changes are included in this PR?
* **New configuration parameter:** add `max_rows` as the authoritative
setting for the maximum number of rows shown in `__repr__` / HTML repr
rendering.
* **Deprecation path:** keep `repr_rows` as a deprecated alias for
`max_rows` (Python and Rust interop), with validation to prevent specifying
both inconsistently.
* **Validation improvements:** enforce `min_rows_display <= max_rows`
(Python) and `min_rows <= max_rows` (Rust) with clear error messages.
* **Default adjustment:** change the default `min_rows_display/min_rows`
from 20 to 10 so defaults no longer violate the max-row cap.
* **Documentation updates:** update the user guide examples and narrative to
use `max_rows` instead of `repr_rows`.
* **Tests updated/added:** update unit tests to cover `max_rows` behavior
and new validation cases.
## Are these changes tested?
Yes.
* Updated existing formatter tests to use `max_rows`.
* Added validation tests to ensure:
* `max_rows` must be a positive integer.
* `min_rows_display` cannot exceed `max_rows`.
## Are there any user-facing changes?
Yes.
* **Config rename:** `repr_rows` is now deprecated in favor of `max_rows`.
* Existing code using `repr_rows` continues to work via a
backwards-compatible alias.
* Supplying both `repr_rows` and `max_rows` (or supplying a conflicting
`repr_rows`) raises a `ValueError` to avoid ambiguous configuration.
* **Default behavior:** the default minimum rows displayed changes from 20
to 10.
* **Docs:** examples now reference `max_rows`, and documentation clarifies
that `min_rows_display` must be `<= max_rows`.
## LLM-generated code disclosure
This PR includes code and comments generated with assistance from an LLM.
All LLM-generated content has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]