Re: [PR] Add missing aggregate functions [datafusion-python]

via GitHub Fri, 03 Apr 2026 12:23:47 -0700


Copilot commented on code in PR #1471:
URL: 
https://github.com/apache/datafusion-python/pull/1471#discussion_r3034074383



##########
python/tests/test_functions.py:
##########
@@ -1435,3 +1435,49 @@ def test_coalesce(df):
     assert result.column(0) == pa.array(
         ["Hello", "fallback", "!"], type=pa.string_view()
     )
+
+
+def test_percentile_cont():
+    ctx = SessionContext()
+    df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
+    result = df.aggregate(
+        [], [f.percentile_cont(column("a"), 0.5).alias("v")]
+    ).collect()[0]
+    assert result.column(0)[0].as_py() == 3.0
+
+
+def test_percentile_cont_with_filter():
+    ctx = SessionContext()
+    df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
+    result = df.aggregate(
+        [],
+        [
+            f.percentile_cont(
+                column("a"), 0.5, filter=column("a") > literal(1.0)
+            ).alias("v")
+        ],
+    ).collect()[0]
+    assert result.column(0)[0].as_py() == 3.5
+
+
+def test_grouping():
+    ctx = SessionContext()
+    df = ctx.from_pydict({"a": [1, 1, 2], "b": [10, 20, 30]})
+    # In a simple GROUP BY (no grouping sets), grouping() returns 0 for all 
rows.
+    # Note: grouping() must not be aliased directly in the aggregate 
expression list
+    # due to an upstream DataFusion analyzer limitation (the 
ResolveGroupingFunction
+    # rule doesn't unwrap Alias nodes). Apply aliases via a follow-up select 
instead.
+    result = df.aggregate(
+        [column("a")], [f.grouping(column("a")), f.sum(column("b")).alias("s")]
+    ).collect()
+    grouping_col = pa.concat_arrays([batch.column(1) for batch in 
result]).to_pylist()

Review Comment:
   This test relies on a fixed column index (`batch.column(1)`) to locate the 
`grouping(a)` output, which is brittle if the output schema/order changes. 
Since the comment already notes aliases should be applied via a follow-up 
`select`, consider selecting/renaming the grouping expression after 
`aggregate()` and then collecting by column name to make the assertion more 
robust.
   ```suggestion
       result = (
           df.aggregate(
               [column("a")], [f.grouping(column("a")), 
f.sum(column("b")).alias("s")]
           )
           .select(
               column("a"),
               column("grouping(a)").alias("grouping_a"),
               column("s"),
           )
           .collect()
       )
       grouping_col = [
           value for batch in result for value in 
batch.to_pydict()["grouping_a"]
       ]
   ```



##########
python/datafusion/functions.py:
##########
@@ -3581,6 +3625,30 @@ def array_agg(
     )
 
 
+def grouping(
+    expression: Expr,
+    distinct: bool | None = None,
+    filter: Expr | None = None,
+) -> Expr:
+    """Returns 1 if the data is aggregated across the specified column, or 0 
otherwise.
+
+    This function is used with ``GROUPING SETS``, ``CUBE``, or ``ROLLUP`` to
+    distinguish between aggregated and non-aggregated rows. In a regular
+    ``GROUP BY`` without grouping sets, it always returns 0.
+
+    Note: The ``grouping`` aggregate function is rewritten by the query
+    optimizer before execution, so it works correctly even though its

Review Comment:
   The docstring says the `grouping` function is rewritten by the query 
optimizer, but the Rust-side comment references the `ResolveGroupingFunction` 
analyzer rule. To avoid confusing users, update this wording to reflect the 
analyzer/planner rewrite rather than the optimizer (or align the terminology 
with DataFusion’s actual rule phase).
   ```suggestion
       Note: The ``grouping`` aggregate function is rewritten during query
       analysis/planning before execution, so it works correctly even though its
   ```



##########
python/datafusion/functions.py:
##########
@@ -3581,6 +3625,30 @@ def array_agg(
     )
 
 
+def grouping(
+    expression: Expr,
+    distinct: bool | None = None,

Review Comment:
   `distinct` is typed as `bool | None` with default `None`, but other 
aggregate wrappers that expose `distinct` use a non-optional `bool` defaulting 
to `False` (e.g., `array_agg`). Since the Rust layer only acts on 
`distinct=True`, consider changing this to `distinct: bool = False` (or remove 
the parameter entirely if `grouping` doesn’t support DISTINCT) to match the 
established API pattern.
   ```suggestion
       distinct: bool = False,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add missing aggregate functions [datafusion-python]

Reply via email to