Phixsura commented on PR #38704: URL: https://github.com/apache/shardingsphere/pull/38704#issuecomment-4527928042
> Decision > > Merge Verdict: Not Mergeable > > Reviewed Scope: PR #38704 latest head [e8bfdc6](https://github.com/apache/shardingsphere/commit/e8bfdc6fb0822b6ca856cecc859ac5e9a55642e2); 8 files in kernel/sql-federation/compiler; linked issues #31897, #31898, #31899; MySQL string-function semantics; shared SQL Federation operator-table path for MySQL/PostgreSQL/openGauss/Oracle/default. > > Not Reviewed Scope: no full local Maven run or live MySQL/PostgreSQL runtime execution; relied on latest GitHub checks plus static/root-cause review. > > Need Expert Review: SQL Federation/Calcite operator-table reviewer recommended. > > Positive Feedback > > The PR is in the right direction by adding dedicated operator entries and unit coverage for simple LENGTH, BIT_LENGTH, and LOCATE cases. The change is localized and has CI green on the latest head, aside from skipped matrix stages. > > Major Issues > > [P1] #31899 is not fully fixed. The reported LOCATE(x'64', _utf8mb4'abcdef') failure was unsupported CommonExpressionSegment, but ExpressionConverter.java (line 97) still throws for CommonExpressionSegment. The PR only adds a LOCATE operator whose metadata accepts STRING operands, and the new compiler cases only cover plain character literals. Please either fix binary/hex literal conversion and add the exact regression case, or do not close that part of #31899 yet. > > [P1] MySQL string semantics are incomplete for charset/collation. MySQLLengthFunction.java (line 61) and MySQLBitLengthFunction hardcode UTF-8 byte counting. MySQL documents LENGTH() as byte length of the string and supports character-set introducers such as _latin1 and _binary; LOCATE() is case-sensitive only when an argument is binary. This creates counterexamples such as _latin1'Müller' byte length and nonbinary case-insensitive LOCATE. Please make the implementation charset/collation-aware, or explicitly narrow the supported behavior with tests and reviewer agreement. Sources: [MySQL string functions](https://dev.mysql.com/doc/refman/8.4/en/string-functions.html), [MySQL character set introducers](https://dev.mysql.com/doc/refman/8.4/en/charset-introducer.html). > > [P1] Blast-radius risk across non-MySQL SQL Federation remains unresolved. CompilerContextFactory.java (line 77) always chains new MySQLOperatorTable() before the database-specific Calcite library, so adding LENGTH, BIT_LENGTH, and LOCATE exposes MySQL behavior to PostgreSQL, openGauss, Oracle, and fallback/default contexts. PostgreSQL documents bit_length, octet_length, position, and strpos, not MySQL LOCATE(substr, str, pos) semantics. Please gate these functions to MySQL or add explicit non-target dialect counterexamples proving no semantic regression. Source: [PostgreSQL string functions](https://www.postgresql.org/docs/current/functions-string.html). > > Newly Introduced Issues > > git diff --check reports trailing whitespace in the new files, and the new parameterized tests use display names like length({0}) -> {1} instead of the required "{0}" template. Please clean these before the next revision. > > Next Steps > > Add production-path regression tests for the exact issue SQLs, including _latin1, _utf8mb4, binary/hex literals, column/WHERE usage, and nested function cases. > > Add cross-dialect validation for at least PostgreSQL/openGauss showing MySQL-only functions are not leaked, or restrict MySQLOperatorTable registration to MySQL. > > Re-run scoped verification with dependencies, for example ./mvnw -pl kernel/sql-federation/compiler -am -DskipITs -Dspotless.skip=true -Dtest=SQLStatementCompilerIT,MySQLLengthFunctionTest,MySQLBitLengthFunctionTest,MySQLLocateFunctionTest test. Thanks for the thorough review, all three P1s hold up. On P1.1: I misread #31899 as a single task. It's actually a checkbox list and the `LOCATE(x'64', _utf8mb4'abcdef')` case has its real root cause in `ExpressionConverter.java:97` (CommonExpressionSegment), which this PR doesn't touch. I'll drop `Closes #31899` from the description and leave that case for a separate fix. On P1.2: Acknowledged on charset/collation. Would you prefer (a) narrowing the implementation to ASCII-only with explicit tests and Javadoc, or (b) making it charset-aware? I haven't traced how Calcite surfaces charset/collation info into UDFs yet, so (b) is open-ended for me. On P1.3: I'd like to confirm scope here. `CompilerContextFactory.getOperatorTables()` already prepends `new MySQLOperatorTable()` for every dialect today, but this PR is what makes the leak actually break things, since LENGTH is the first registered function with a real semantic clash against PostgreSQL/openGauss/Oracle. Two options: 1. Restrict registration to the MySQL context inside this PR (touches `CompilerContextFactory`, expands the diff). 2. Drop LENGTH/BIT_LENGTH/LOCATE from this PR entirely and submit an architecture-fix PR first. Which would you prefer? On the nits: trailing whitespace and the parameterized-test display names will be cleaned up in the next revision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
