Re: [PR] Enforce DataFrame display memory limits with `max_rows` + `min_rows` constraint (deprecate `repr_rows`) [datafusion-python]

via GitHub Wed, 04 Feb 2026 04:23:19 -0800


timsaucer commented on code in PR #1367:
URL: 
https://github.com/apache/datafusion-python/pull/1367#discussion_r2763668011



##########
docs/source/user-guide/dataframe/rendering.rst:
##########
@@ -58,7 +58,7 @@ You can customize how DataFrames are rendered by configuring 
the formatter:
         max_height=300,            # Maximum height in pixels
         max_memory_bytes=2097152,  # Maximum memory for rendering (2MB)
         min_rows_display=20,       # Minimum number of rows to display
-        repr_rows=10,              # Number of rows to display in __repr__

Review Comment:
   It looks like the default here has `min_rows > max_rows`. Also should we 
have consistent naming of the two? Either `min_rows` and `max_rows` or 
`min_rows_display` and `max_rows_display`? 
   
   I think the `_display` was differentiating what happens during a `display()` 
call vs `__repr__` but I think these values get used during both calls.



##########
python/datafusion/dataframe_formatter.py:
##########
@@ -163,10 +253,13 @@ def __init__(
             Maximum height of the displayed table in pixels.
         max_memory_bytes : int, default 2097152 (2MB)
             Maximum memory in bytes for rendered data.
-        min_rows_display : int, default 20
-            Minimum number of rows to display.
-        repr_rows : int, default 10
-            Default number of rows to display in repr output.
+        min_rows_display : int, default 10

Review Comment:
   It's not about this PR per se, but maybe this is an opportunity to tighten 
up the comments here. We're repeating ourselves with the types and defaults. 
Those are already in the type hints. I think it's becoming customary to not 
duplicate that information and the argument line is the preferred place to keep 
it. That way we don't have to worry about maintaining the values in two places.



##########
python/tests/test_dataframe.py:
##########
@@ -1458,15 +1458,65 @@ def test_html_formatter_memory(df, 
clean_formatter_state):
     assert "data truncated" not in html_output.lower()
 
 
-def test_html_formatter_repr_rows(df, clean_formatter_state):
-    configure_formatter(min_rows_display=2, repr_rows=2)
+def test_html_formatter_memory_boundary_conditions(df, clean_formatter_state):

Review Comment:
   Maybe switch to `large_df` instead of `df` here?



##########
python/tests/test_dataframe.py:
##########
@@ -1458,15 +1458,65 @@ def test_html_formatter_memory(df, 
clean_formatter_state):
     assert "data truncated" not in html_output.lower()
 
 
-def test_html_formatter_repr_rows(df, clean_formatter_state):
-    configure_formatter(min_rows_display=2, repr_rows=2)
+def test_html_formatter_memory_boundary_conditions(df, clean_formatter_state):
+    """Test memory limit behavior at boundary conditions.
+
+    This test validates that the formatter correctly handles edge cases when
+    the memory limit is very close to actual data size, ensuring that min_rows
+    constraint is properly respected while respecting memory limits.
+    """
+
+    # Get the raw size of the data to test boundary conditions
+    # First, capture output with no limits
+    configure_formatter(max_memory_bytes=10 * MB, min_rows_display=1, 
max_rows=100)

Review Comment:
   If you do switch to large_df then I think it may go above the 100 limit you 
have



##########
python/tests/test_dataframe.py:
##########
@@ -1458,15 +1458,65 @@ def test_html_formatter_memory(df, 
clean_formatter_state):
     assert "data truncated" not in html_output.lower()
 
 
-def test_html_formatter_repr_rows(df, clean_formatter_state):
-    configure_formatter(min_rows_display=2, repr_rows=2)
+def test_html_formatter_memory_boundary_conditions(df, clean_formatter_state):
+    """Test memory limit behavior at boundary conditions.
+
+    This test validates that the formatter correctly handles edge cases when
+    the memory limit is very close to actual data size, ensuring that min_rows
+    constraint is properly respected while respecting memory limits.
+    """
+
+    # Get the raw size of the data to test boundary conditions
+    # First, capture output with no limits
+    configure_formatter(max_memory_bytes=10 * MB, min_rows_display=1, 
max_rows=100)
+    unrestricted_output = df._repr_html_()
+    unrestricted_rows = count_table_rows(unrestricted_output)
+
+    # Test 1: Very small memory limit should still respect min_rows
+    configure_formatter(max_memory_bytes=10, min_rows_display=1)

Review Comment:
   I think a better test is one where we do hit the memory limit well before we 
hit the min number of rows, hence the recommendation to switch to `large_df`. 
Actually, maybe we want a different dataframe, one where we know we have 
multiple batches instead of a single batch. The thing this isn't doing is 
verifying we've ended the stream early, but I think that would have to be a 
rust test instead of a pytest.



##########
docs/source/user-guide/dataframe/rendering.rst:
##########
@@ -191,7 +191,7 @@ You can control how much data is displayed and how much 
memory is used for rende
     configure_formatter(
         max_memory_bytes=4 * 1024 * 1024,  # 4MB maximum memory for display
         min_rows_display=50,               # Always show at least 50 rows
-        repr_rows=20                       # Show 20 rows in __repr__ output

Review Comment:
   Same as above, difference between `_display` and without the trailer and 
also we have here `min_rows > max_rows`



##########
python/datafusion/dataframe_formatter.py:
##########
@@ -231,6 +317,55 @@ def __init__(
         self._custom_cell_builder: Callable[[Any, int, int, str], str] | None 
= None
         self._custom_header_builder: Callable[[Any], str] | None = None
 
+    @property
+    def max_rows(self) -> int:
+        """Get the maximum number of rows to display.
+
+        Returns:
+            The maximum number of rows to display in repr output
+        """
+        return self._max_rows
+
+    @max_rows.setter
+    def max_rows(self, value: int) -> None:
+        """Set the maximum number of rows to display.
+
+        Args:
+            value: The maximum number of rows
+        """
+        self._max_rows = value
+
+    @property
+    def repr_rows(self) -> int:
+        """Get the maximum number of rows (deprecated name).
+
+        .. deprecated::
+            Use :attr:`max_rows` instead. This property is provided for
+            backward compatibility.
+
+        Returns:
+            The maximum number of rows to display
+        """
+        return self._max_rows
+
+    @repr_rows.setter
+    def repr_rows(self, value: int) -> None:
+        """Set the maximum number of rows using deprecated name.
+

Review Comment:
   Same, why add for deprecated?



##########
src/dataframe.rs:
##########
@@ -149,12 +153,12 @@ fn build_formatter_config_from_python(formatter: 
&Bound<'_, PyAny>) -> PyResult<
     let default_config = FormatterConfig::default();
     let max_bytes = get_attr(formatter, "max_memory_bytes", 
default_config.max_bytes);
     let min_rows = get_attr(formatter, "min_rows_display", 
default_config.min_rows);
-    let repr_rows = get_attr(formatter, "repr_rows", default_config.repr_rows);
+    let max_rows = get_attr(formatter, "max_rows", default_config.max_rows);

Review Comment:
   Since users may have provided their own implementation for the formatter, I 
think we need to first try getting `max_rows`. If that fails, try getting 
`repr_rows`. If that fails, take default. When we remove `repr_rows` entirely 
after it's been deprecated for a few releases, then we can revert to this 
simpler logic.



##########
python/datafusion/dataframe_formatter.py:
##########
@@ -231,6 +317,55 @@ def __init__(
         self._custom_cell_builder: Callable[[Any, int, int, str], str] | None 
= None
         self._custom_header_builder: Callable[[Any], str] | None = None
 
+    @property
+    def max_rows(self) -> int:
+        """Get the maximum number of rows to display.
+
+        Returns:
+            The maximum number of rows to display in repr output
+        """
+        return self._max_rows
+
+    @max_rows.setter
+    def max_rows(self, value: int) -> None:
+        """Set the maximum number of rows to display.
+
+        Args:
+            value: The maximum number of rows
+        """
+        self._max_rows = value
+
+    @property

Review Comment:
   If `repr_rows` is being deprecated, why add an accessor?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enforce DataFrame display memory limits with `max_rows` + `min_rows` constraint (deprecate `repr_rows`) [datafusion-python]

Reply via email to