lriggs opened a new pull request, #49986:
URL: https://github.com/apache/arrow/pull/49986

   ### Rationale for this change
   Silently allowing duplicate functions that only differ by return type 
provides a confusing function registry. Callers do not typically expect the 
return type to be part of determining which function to call leading to 
ambiguity.
   
   It is possible to register different Gandiva functions with the same alias 
and parameters but different return types, resulting in confusing function 
overloads.
   
   For example, DATE_EXTRACTION_TRUNCATION_FNS in 
[cpp/src/gandiva/function_registry_datetime.cc] was invoked twice with the same 
SQL alias lists — once for extract* (returns int64) and once for date_trunc_* 
(returns the input date/timestamp type):
   ```
   
   DATE_EXTRACTION_TRUNCATION_FNS(EXTRACT_SAFE_NULL_IF_NULL, extract)
   DATE_EXTRACTION_TRUNCATION_FNS(TRUNCATE_SAFE_NULL_IF_NULL, date_trunc_)
   ```
   As a result the registry contained four entries for day(...) where there 
should have been two:
   
   ```
   int64 day(timestamp)   → extractDay_timestamp
   int64 day(date)        → extractDay_date64
   timestamp day(timestamp) → date_trunc_Day_timestamp
   date day(date)         → date_trunc_Day_date64
   ```
   The same problem existed for every calendar-unit alias: year, month, 
quarter, week, weekofyear, yearweek, dayofmonth, hour, minute, second. 
Resolution behavior depended on the caller's inferred return type, which is not 
the SQL semantics anyone expects from day(timestamp_col).
   
   FunctionRegistry::Add was silently allowing these registrations: 
unordered_map::emplace keeps the first entry and discards subsequent ones with 
no warning.
   
   ### What changes are included in this PR?
   #### Diagnostics
   Added two ARROW_LOG(ERROR) checks inside FunctionRegistry::Add:
   **Duplicate-signature check** — when pc_registry_map_.emplace reports the 
entry already existed (same name + params + return type), log the conflict 
including both pc_names.
   **Alias-collision check** — a new call_shape_map_ keyed on (lower(name), 
param_type_ids) (return type excluded) detects the case where the same 
name(args) could resolve to two functions with different return types. This is 
the check that catches the day family.
   These run at registry-construction time, so future regressions surface in 
test output rather than being silently shadowed.
   
   #### Macro split 
   **DATE_EXTRACTION_TRUNCATION_FNS** is split into two macros:
   
   **DATE_EXTRACTION_FNS** — keeps the existing SQL alias lists ({"year"}, 
{"day", "dayofmonth"}, etc.).
   **DATE_TRUNCATION_FNS** — every alias list is {}. The truncate functions are 
still reachable through their date_trunc_* base names.
   
   #### Strengthened registry test 
   
   Added a third loop to TestNoDuplicates that mirrors the runtime 
alias-collision check: keys each registered signature on (lower(name), 
param_types) only and fails if two signatures share that key but disagree on 
return type. The two pre-existing loops still cover their original cases (same 
pc_name + params + ret, and same full signature appearing twice).
   
   #### Cleanup
   
   Removed the standalone NativeFunction("add", ..., date64+int64 → timestamp, 
"add_date64_int64"). DATE_ADD_FNS(add, {}) already provides the entry with the 
correct → date64 return type matching the precompiled symbol.
   
   Dropped the {"convert_fromutf8"} and {"convert_replaceutf8"} aliases; the 
base names already match case-insensitively.
   
   ### Are these changes tested?
   Registry construction is now silent — no duplicate or alias-collision logs.
   TestFunctionRegistry — 6/6 pass, including the strengthened TestNoDuplicates.
   ctest -R gandiva — 4/4 binaries pass (gandiva-internals-test, 
gandiva-precompiled-test, gandiva-projector-test, 
gandiva-projector-test-static).
   
   ### Are there any user-facing changes?
   The Gandiva function list is different, with some aliases no longer 
appearing. Functionality is still available.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to