The GitHub Actions job "Build and push images" on texera.git/main has failed.
Run started by GitHub user bobbai00 (triggered by bobbai00).

Head commit for run:
6980f19376c016b3505663827465aa41283e608e / carloea2 <[email protected]>
refactor(core): unify type ops and reuse in sort/agg (#4024)

### What changes were proposed in this PR?

1. **Centralize and extend `AttributeType` operations**

Move and refactor the existing attribute-type helpers into
`AttributeTypeUtils`:

   * `compare`, `add`, `zeroValue`, `minValue`, `maxValue`.
* Unify null-handling semantics across these operations. (use of
match-case instead of if + match)

   Extend support to additional types:

* Add comparison/aggregation support for `BOOLEAN`, `STRING`, and
`BINARY`.

   Change numeric coercion strategy:

* Coerce numeric values to `Number` instead of a specific primitive type
(e.g., `Double`) to reduce `ClassCastException`s when the input is not
strictly schema-validated.
* Preserve existing comparison semantics for doubles by delegating to
`java.lang.Double.compare` (including handling of ±∞ and `NaN`).

   Introduce “identity” helpers:

* `zeroValue` returns an additive identity for numeric/timestamp types,
and `Array.emptyByteArray` for `BINARY` as a safe, non-throwing
identity.
* `minValue` / `maxValue`: provide lower/upper bounds for supported
numeric and timestamp types.

2. **Refactor operators to reuse `AttributeTypeUtils`**

* `AggregationOperation`: implement `SUM` / `MIN` / `MAX` using the
centralized helpers instead of custom per-operator logic.
* `StableMergeSortOpExec`: reuse the typed compare logic from
`AttributeTypeUtils`.
* `SortPartitionsOpExec`: simplify to use a one-liner comparator based
on `AttributeTypeUtils.compare` (or a thin wrapper) for clarity and
reuse.

3. **Add tests**
*
workflow-core/src/test/scala/org/apache/amber/core\tuple/AttributeTypeUtilsSpec.scala
* **compare**: Verifies correct null-handling and ordering for INTEGER,
BOOLEAN, TIMESTAMP, STRING, and BINARY values.
* **add**: Ensures `null` acts as identity and confirms correct addition
for INTEGER, LONG, DOUBLE, and TIMESTAMP.
* **zeroValue**: Checks that numeric/timestamp zero identities and empty
binary array for BINARY are returned, and that unsupported types (e.g.,
STRING) throw.
* **minValue / maxValue**: Validate correct numeric and timestamp
bounds, BINARY minimum, and exceptions for unsupported types (e.g.,
BOOLEAN, STRING).
*
workflow-operator/src/test/scala/org/apache/amber/operator/aggregate/AggregateOpSpec.scala
* Verifies `getAggregationAttribute` chooses the correct result type for
different functions (SUM keeps input type, COUNT → INTEGER, CONCAT →
STRING).
* Checks `getAggFunc` SUM behavior for INTEGER and DOUBLE columns,
ensuring correct totals and preserved fractional values.
* Tests COUNT, CONCAT, MIN, MAX, and AVERAGE aggregations, including
correct handling of `null` values and edge cases like “no rows”.
* Confirms `getFinal` rewrites COUNT into a SUM on the intermediate
count column and rewires attributes correctly for non-COUNT functions.
* Exercises `AggregateOpExec` end-to-end: SUM grouped by a key (city)
and combined global SUM+COUNT with no group-by keys, validating the
produced tuples.


5. **Scope / non-goals / Extras**
   * No change to external APIs
* Main behavior changes are localized to `AttributeType` operations and
the operators that consume them.

---

**Any related issues, documentation, discussions?**

* Closes: #3923

**How was this PR tested?**

Workflow Image:
<img width="1684" height="859" alt="image"
src="https://github.com/user-attachments/assets/2682ebdc-0f45-40c6-b304-0cea0b76b44f";
/>

Workflow file: 

[agg_test_1.json](https://github.com/user-attachments/files/23540242/agg_test_1.json)

Python benchmark:

```
import pandas as pd

df = pd.read_csv("/mnt/data/test.csv")

# Limit BEFORE sorting
df_limited = df.head(1000)

# Now sort ascending
df_sorted = df_limited.sort_values("rna_umis", ascending=True)

# Group by pass_all_filters with aggregations
agg = df_sorted.groupby("pass_all_filters")["rna_umis"].agg(
    min="min", max="max", count="count", avg="mean", sum="sum"
).reset_index()

agg

```
Python Result:
<img width="928" height="188" alt="image"
src="https://github.com/user-attachments/assets/69da33cd-ada4-4b05-a3f9-ae139f8575b9";
/>

Texera Result (Avg):

False | 0 | 80926 | 240 | 15987.68 | 3837043
-- | -- | -- | -- | -- | --
True | 11893 | 102559 | 760 | 35557.93 | 27024027

For timestamps test:
- 1970-01-01T00:00:00Z
- 2000-02-29T12:00:00Z
- 2024-12-31T23:59:59Z


1. Avg:

- New version: 909835199750
- Previous version: 909835199750

2. Sum:

- New version: 2055-03-01T05:59:59.000Z (UTC)
- Previous version: 2055-03-01T11:59:59.000Z (UTC-6; Mexico City Time)

**Was this PR authored or co-authored using generative AI tooling?**

* Co-authored with ChatGPT.

Report URL: https://github.com/apache/texera/actions/runs/19588898040

With regards,
GitHub Actions via GitBox

Reply via email to