Phixsura commented on PR #38704:
URL: https://github.com/apache/shardingsphere/pull/38704#issuecomment-4527928042

   > Decision
   > 
   > Merge Verdict: Not Mergeable
   > 
   > Reviewed Scope: PR #38704 latest head 
[e8bfdc6](https://github.com/apache/shardingsphere/commit/e8bfdc6fb0822b6ca856cecc859ac5e9a55642e2);
 8 files in kernel/sql-federation/compiler; linked issues #31897, #31898, 
#31899; MySQL string-function semantics; shared SQL Federation operator-table 
path for MySQL/PostgreSQL/openGauss/Oracle/default.
   > 
   > Not Reviewed Scope: no full local Maven run or live MySQL/PostgreSQL 
runtime execution; relied on latest GitHub checks plus static/root-cause review.
   > 
   > Need Expert Review: SQL Federation/Calcite operator-table reviewer 
recommended.
   > 
   > Positive Feedback
   > 
   > The PR is in the right direction by adding dedicated operator entries and 
unit coverage for simple LENGTH, BIT_LENGTH, and LOCATE cases. The change is 
localized and has CI green on the latest head, aside from skipped matrix stages.
   > 
   > Major Issues
   > 
   > [P1] #31899 is not fully fixed. The reported LOCATE(x'64', 
_utf8mb4'abcdef') failure was unsupported CommonExpressionSegment, but 
ExpressionConverter.java (line 97) still throws for CommonExpressionSegment. 
The PR only adds a LOCATE operator whose metadata accepts STRING operands, and 
the new compiler cases only cover plain character literals. Please either fix 
binary/hex literal conversion and add the exact regression case, or do not 
close that part of #31899 yet.
   > 
   > [P1] MySQL string semantics are incomplete for charset/collation. 
MySQLLengthFunction.java (line 61) and MySQLBitLengthFunction hardcode UTF-8 
byte counting. MySQL documents LENGTH() as byte length of the string and 
supports character-set introducers such as _latin1 and _binary; LOCATE() is 
case-sensitive only when an argument is binary. This creates counterexamples 
such as _latin1'Müller' byte length and nonbinary case-insensitive LOCATE. 
Please make the implementation charset/collation-aware, or explicitly narrow 
the supported behavior with tests and reviewer agreement. Sources: [MySQL 
string 
functions](https://dev.mysql.com/doc/refman/8.4/en/string-functions.html), 
[MySQL character set 
introducers](https://dev.mysql.com/doc/refman/8.4/en/charset-introducer.html).
   > 
   > [P1] Blast-radius risk across non-MySQL SQL Federation remains unresolved. 
CompilerContextFactory.java (line 77) always chains new MySQLOperatorTable() 
before the database-specific Calcite library, so adding LENGTH, BIT_LENGTH, and 
LOCATE exposes MySQL behavior to PostgreSQL, openGauss, Oracle, and 
fallback/default contexts. PostgreSQL documents bit_length, octet_length, 
position, and strpos, not MySQL LOCATE(substr, str, pos) semantics. Please gate 
these functions to MySQL or add explicit non-target dialect counterexamples 
proving no semantic regression. Source: [PostgreSQL string 
functions](https://www.postgresql.org/docs/current/functions-string.html).
   > 
   > Newly Introduced Issues
   > 
   > git diff --check reports trailing whitespace in the new files, and the new 
parameterized tests use display names like length({0}) -> {1} instead of the 
required "{0}" template. Please clean these before the next revision.
   > 
   > Next Steps
   > 
   > Add production-path regression tests for the exact issue SQLs, including 
_latin1, _utf8mb4, binary/hex literals, column/WHERE usage, and nested function 
cases.
   > 
   > Add cross-dialect validation for at least PostgreSQL/openGauss showing 
MySQL-only functions are not leaked, or restrict MySQLOperatorTable 
registration to MySQL.
   > 
   > Re-run scoped verification with dependencies, for example ./mvnw -pl 
kernel/sql-federation/compiler -am -DskipITs -Dspotless.skip=true 
-Dtest=SQLStatementCompilerIT,MySQLLengthFunctionTest,MySQLBitLengthFunctionTest,MySQLLocateFunctionTest
 test.
   
   Thanks for the thorough review, all three P1s hold up.
   
   On P1.1: I misread #31899 as a single task. It's actually a checkbox list 
and the `LOCATE(x'64', _utf8mb4'abcdef')` case has its real root cause in 
`ExpressionConverter.java:97` (CommonExpressionSegment), which this PR doesn't 
touch. I'll drop `Closes #31899` from the description and leave that case for a 
separate fix.
   
   On P1.2: Acknowledged on charset/collation. Would you prefer (a) narrowing 
the implementation to ASCII-only with explicit tests and Javadoc, or (b) making 
it charset-aware? I haven't traced how Calcite surfaces charset/collation info 
into UDFs yet, so (b) is open-ended for me.
   
   On P1.3: I'd like to confirm scope here. 
`CompilerContextFactory.getOperatorTables()` already prepends `new 
MySQLOperatorTable()` for every dialect today, but this PR is what makes the 
leak actually break things, since LENGTH is the first registered function with 
a real semantic clash against PostgreSQL/openGauss/Oracle. Two options:
   
   1. Restrict registration to the MySQL context inside this PR (touches 
`CompilerContextFactory`, expands the diff).
   2. Drop LENGTH/BIT_LENGTH/LOCATE from this PR entirely and submit an 
architecture-fix PR first.
   
   Which would you prefer?
   
   On the nits: trailing whitespace and the parameterized-test display names 
will be cleaned up in the next revision.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to