kosiew opened a new pull request, #1367:
URL: https://github.com/apache/datafusion-python/pull/1367

   ## Which issue does this PR close?
   
   * Closes #1362.
   
   ## Rationale for this change
   
   Large DataFrames could ignore the configured `max_memory_bytes` limit during 
display. With the prior defaults (`repr_rows=10`, `min_rows_display=20`), the 
collection loop would *always* continue until at least `min_rows_display` rows 
were gathered, even when the memory limit had already been exceeded, because 
`min_rows_display` was greater than the intended row cap.
   
   This PR fixes the root cause by:
   
   * Making the minimum-rows safeguard compatible with the maximum-rows cap.
   * Clarifying configuration intent by renaming `repr_rows` to `max_rows`.
   
   ## What changes are included in this PR?
   
   * **New configuration parameter:** add `max_rows` as the authoritative 
setting for the maximum number of rows shown in `__repr__` / HTML repr 
rendering.
   * **Deprecation path:** keep `repr_rows` as a deprecated alias for 
`max_rows` (Python and Rust interop), with validation to prevent specifying 
both inconsistently.
   * **Validation improvements:** enforce `min_rows_display <= max_rows` 
(Python) and `min_rows <= max_rows` (Rust) with clear error messages.
   * **Default adjustment:** change the default `min_rows_display/min_rows` 
from 20 to 10 so defaults no longer violate the max-row cap.
   * **Documentation updates:** update the user guide examples and narrative to 
use `max_rows` instead of `repr_rows`.
   * **Tests updated/added:** update unit tests to cover `max_rows` behavior 
and new validation cases.
   
   ## Are these changes tested?
   
   Yes.
   
   * Updated existing formatter tests to use `max_rows`.
   * Added validation tests to ensure:
   
     * `max_rows` must be a positive integer.
     * `min_rows_display` cannot exceed `max_rows`.
   
   ## Are there any user-facing changes?
   
   Yes.
   
   * **Config rename:** `repr_rows` is now deprecated in favor of `max_rows`.
   
     * Existing code using `repr_rows` continues to work via a 
backwards-compatible alias.
     * Supplying both `repr_rows` and `max_rows` (or supplying a conflicting 
`repr_rows`) raises a `ValueError` to avoid ambiguous configuration.
   * **Default behavior:** the default minimum rows displayed changes from 20 
to 10.
   * **Docs:** examples now reference `max_rows`, and documentation clarifies 
that `min_rows_display` must be `<= max_rows`.
   
   ## LLM-generated code disclosure
   
   This PR includes code and comments generated with assistance from an LLM. 
All LLM-generated content has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to