The GitHub Actions job "Build and push images" on texera.git/main has failed. Run started by GitHub user bobbai00 (triggered by bobbai00).
Head commit for run: 6980f19376c016b3505663827465aa41283e608e / carloea2 <[email protected]> refactor(core): unify type ops and reuse in sort/agg (#4024) ### What changes were proposed in this PR? 1. **Centralize and extend `AttributeType` operations** Move and refactor the existing attribute-type helpers into `AttributeTypeUtils`: * `compare`, `add`, `zeroValue`, `minValue`, `maxValue`. * Unify null-handling semantics across these operations. (use of match-case instead of if + match) Extend support to additional types: * Add comparison/aggregation support for `BOOLEAN`, `STRING`, and `BINARY`. Change numeric coercion strategy: * Coerce numeric values to `Number` instead of a specific primitive type (e.g., `Double`) to reduce `ClassCastException`s when the input is not strictly schema-validated. * Preserve existing comparison semantics for doubles by delegating to `java.lang.Double.compare` (including handling of ±∞ and `NaN`). Introduce “identity” helpers: * `zeroValue` returns an additive identity for numeric/timestamp types, and `Array.emptyByteArray` for `BINARY` as a safe, non-throwing identity. * `minValue` / `maxValue`: provide lower/upper bounds for supported numeric and timestamp types. 2. **Refactor operators to reuse `AttributeTypeUtils`** * `AggregationOperation`: implement `SUM` / `MIN` / `MAX` using the centralized helpers instead of custom per-operator logic. * `StableMergeSortOpExec`: reuse the typed compare logic from `AttributeTypeUtils`. * `SortPartitionsOpExec`: simplify to use a one-liner comparator based on `AttributeTypeUtils.compare` (or a thin wrapper) for clarity and reuse. 3. **Add tests** * workflow-core/src/test/scala/org/apache/amber/core\tuple/AttributeTypeUtilsSpec.scala * **compare**: Verifies correct null-handling and ordering for INTEGER, BOOLEAN, TIMESTAMP, STRING, and BINARY values. * **add**: Ensures `null` acts as identity and confirms correct addition for INTEGER, LONG, DOUBLE, and TIMESTAMP. * **zeroValue**: Checks that numeric/timestamp zero identities and empty binary array for BINARY are returned, and that unsupported types (e.g., STRING) throw. * **minValue / maxValue**: Validate correct numeric and timestamp bounds, BINARY minimum, and exceptions for unsupported types (e.g., BOOLEAN, STRING). * workflow-operator/src/test/scala/org/apache/amber/operator/aggregate/AggregateOpSpec.scala * Verifies `getAggregationAttribute` chooses the correct result type for different functions (SUM keeps input type, COUNT → INTEGER, CONCAT → STRING). * Checks `getAggFunc` SUM behavior for INTEGER and DOUBLE columns, ensuring correct totals and preserved fractional values. * Tests COUNT, CONCAT, MIN, MAX, and AVERAGE aggregations, including correct handling of `null` values and edge cases like “no rows”. * Confirms `getFinal` rewrites COUNT into a SUM on the intermediate count column and rewires attributes correctly for non-COUNT functions. * Exercises `AggregateOpExec` end-to-end: SUM grouped by a key (city) and combined global SUM+COUNT with no group-by keys, validating the produced tuples. 5. **Scope / non-goals / Extras** * No change to external APIs * Main behavior changes are localized to `AttributeType` operations and the operators that consume them. --- **Any related issues, documentation, discussions?** * Closes: #3923 **How was this PR tested?** Workflow Image: <img width="1684" height="859" alt="image" src="https://github.com/user-attachments/assets/2682ebdc-0f45-40c6-b304-0cea0b76b44f" /> Workflow file: [agg_test_1.json](https://github.com/user-attachments/files/23540242/agg_test_1.json) Python benchmark: ``` import pandas as pd df = pd.read_csv("/mnt/data/test.csv") # Limit BEFORE sorting df_limited = df.head(1000) # Now sort ascending df_sorted = df_limited.sort_values("rna_umis", ascending=True) # Group by pass_all_filters with aggregations agg = df_sorted.groupby("pass_all_filters")["rna_umis"].agg( min="min", max="max", count="count", avg="mean", sum="sum" ).reset_index() agg ``` Python Result: <img width="928" height="188" alt="image" src="https://github.com/user-attachments/assets/69da33cd-ada4-4b05-a3f9-ae139f8575b9" /> Texera Result (Avg): False | 0 | 80926 | 240 | 15987.68 | 3837043 -- | -- | -- | -- | -- | -- True | 11893 | 102559 | 760 | 35557.93 | 27024027 For timestamps test: - 1970-01-01T00:00:00Z - 2000-02-29T12:00:00Z - 2024-12-31T23:59:59Z 1. Avg: - New version: 909835199750 - Previous version: 909835199750 2. Sum: - New version: 2055-03-01T05:59:59.000Z (UTC) - Previous version: 2055-03-01T11:59:59.000Z (UTC-6; Mexico City Time) **Was this PR authored or co-authored using generative AI tooling?** * Co-authored with ChatGPT. Report URL: https://github.com/apache/texera/actions/runs/19621603063 With regards, GitHub Actions via GitBox
