Copilot commented on code in PR #1471:
URL:
https://github.com/apache/datafusion-python/pull/1471#discussion_r3034074383
##########
python/tests/test_functions.py:
##########
@@ -1435,3 +1435,49 @@ def test_coalesce(df):
assert result.column(0) == pa.array(
["Hello", "fallback", "!"], type=pa.string_view()
)
+
+
+def test_percentile_cont():
+ ctx = SessionContext()
+ df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
+ result = df.aggregate(
+ [], [f.percentile_cont(column("a"), 0.5).alias("v")]
+ ).collect()[0]
+ assert result.column(0)[0].as_py() == 3.0
+
+
+def test_percentile_cont_with_filter():
+ ctx = SessionContext()
+ df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
+ result = df.aggregate(
+ [],
+ [
+ f.percentile_cont(
+ column("a"), 0.5, filter=column("a") > literal(1.0)
+ ).alias("v")
+ ],
+ ).collect()[0]
+ assert result.column(0)[0].as_py() == 3.5
+
+
+def test_grouping():
+ ctx = SessionContext()
+ df = ctx.from_pydict({"a": [1, 1, 2], "b": [10, 20, 30]})
+ # In a simple GROUP BY (no grouping sets), grouping() returns 0 for all
rows.
+ # Note: grouping() must not be aliased directly in the aggregate
expression list
+ # due to an upstream DataFusion analyzer limitation (the
ResolveGroupingFunction
+ # rule doesn't unwrap Alias nodes). Apply aliases via a follow-up select
instead.
+ result = df.aggregate(
+ [column("a")], [f.grouping(column("a")), f.sum(column("b")).alias("s")]
+ ).collect()
+ grouping_col = pa.concat_arrays([batch.column(1) for batch in
result]).to_pylist()
Review Comment:
This test relies on a fixed column index (`batch.column(1)`) to locate the
`grouping(a)` output, which is brittle if the output schema/order changes.
Since the comment already notes aliases should be applied via a follow-up
`select`, consider selecting/renaming the grouping expression after
`aggregate()` and then collecting by column name to make the assertion more
robust.
```suggestion
result = (
df.aggregate(
[column("a")], [f.grouping(column("a")),
f.sum(column("b")).alias("s")]
)
.select(
column("a"),
column("grouping(a)").alias("grouping_a"),
column("s"),
)
.collect()
)
grouping_col = [
value for batch in result for value in
batch.to_pydict()["grouping_a"]
]
```
##########
python/datafusion/functions.py:
##########
@@ -3581,6 +3625,30 @@ def array_agg(
)
+def grouping(
+ expression: Expr,
+ distinct: bool | None = None,
+ filter: Expr | None = None,
+) -> Expr:
+ """Returns 1 if the data is aggregated across the specified column, or 0
otherwise.
+
+ This function is used with ``GROUPING SETS``, ``CUBE``, or ``ROLLUP`` to
+ distinguish between aggregated and non-aggregated rows. In a regular
+ ``GROUP BY`` without grouping sets, it always returns 0.
+
+ Note: The ``grouping`` aggregate function is rewritten by the query
+ optimizer before execution, so it works correctly even though its
Review Comment:
The docstring says the `grouping` function is rewritten by the query
optimizer, but the Rust-side comment references the `ResolveGroupingFunction`
analyzer rule. To avoid confusing users, update this wording to reflect the
analyzer/planner rewrite rather than the optimizer (or align the terminology
with DataFusion’s actual rule phase).
```suggestion
Note: The ``grouping`` aggregate function is rewritten during query
analysis/planning before execution, so it works correctly even though its
```
##########
python/datafusion/functions.py:
##########
@@ -3581,6 +3625,30 @@ def array_agg(
)
+def grouping(
+ expression: Expr,
+ distinct: bool | None = None,
Review Comment:
`distinct` is typed as `bool | None` with default `None`, but other
aggregate wrappers that expose `distinct` use a non-optional `bool` defaulting
to `False` (e.g., `array_agg`). Since the Rust layer only acts on
`distinct=True`, consider changing this to `distinct: bool = False` (or remove
the parameter entirely if `grouping` doesn’t support DISTINCT) to match the
established API pattern.
```suggestion
distinct: bool = False,
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]