Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/21825 )
Change subject: IMPALA-889: Add trim() function matching ANSI SQL definition ...................................................................... Patch Set 3: (34 comments) Thanks for review. Fixed/answered all except a couple of niche topics for now: 1. string_expr 2. more tests 3. NULL cases 4. select trim(column1 from str) from t1 I'd appreciate opinions on the entire No-BE approach with phantom class, since BE solution like Extract is viable also (I even have it implemented already). As well as further in-depth on No-Keyword / analysis vs parsing. http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@14 PS2, Line 14: IMPAL > Nit: IMPALA-... Done http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@16 PS2, Line 16: https://github.com/apache/impala/commit/543fa73 > Please do not refer to downstream resources. ok, it's open for public though http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@16 PS2, Line 16: https://github.com/apache/impala/commit/543fa73 > This is an internal link, so it would be better to have a Github link to th Done http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@19 PS2, Line 19: xes: > Since both 'where' and 'string' are keywords, it would be better to put the ok http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@51 PS2, Line 51: . Follow > typo: Although Done http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@51 PS2, Line 51: . Following EXTRACT-FROM, > nit: Although generally, a keyword... ok http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@50 PS2, Line 50: : 3. No keywor > 'TRIM' is already part of the SQL standard. Using it as a keyword would mak It was a reserved word by inheritance from earlier Impala. In current Impala all of the built-in functions are extracted from reserved words. Adding KW_TRIM, which is needed for parsing, would mean introducing a new keyword, which generates syntax error (different from that for a reserved) if used as a column name, etc. http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@52 PS2, Line 52: g > Nit: negative Done http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@52 PS2, Line 52: though g > typo: negative Done http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@55 PS2, Line 55: are specifically removed from the reserved keywords list, but curr > What do you mean exactly? It invalidates statements like "create table t1(trim int)", which would be inconsistent with other built-in functions. http://gerrit.cloudera.org:8080/#/c/21825/2//COMMIT_MSG@64 PS2, Line 64: > Nit: line too long. Done http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/cup/sql-parser.cup@3829 PS2, Line 3829: > We could add an EE test for each of these cases as the "charset" parameter Yes, that's a good idea. Will do in the next patch. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/cup/sql-parser.cup@3861 PS2, Line 3861: | function_name:fn_name LPAREN ident_or_default:i KW_FROM expr:e RPAREN > This syntax rule will also match date_part(year from now()) and create a Tr This is how it was before with the extract being the only function to support FROM. Unnecessary wrapping stems mostly from not using keyword, which comes with it's own problems. See "Analysis vs Parsing" section for more on this. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/cup/sql-parser.cup@3873 PS2, Line 3873: compromising > typo:compromising Done http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/cup/sql-parser.cup@3873 PS2, Line 3873: compromising syntax like > Could you elaborate on it, I don't completely understand the example. Maybe To be honest I simply caught tests failing on this, and then realized that the string_expr change introduced in https://gerrit.cloudera.org/#/c/4474 is exactly for this. Since this goes quite deep into the intricacies of parsing process, I took it as is from that review for now as a relatively minor technical detail. I see string_expr as a way to filter out (although not exhaustively) expr types which are not suitable for trim context on parsing stage, which is a good thing on it's own. The fact that it fixes some corner cases makes it only better. But as you've suggested in other comment, I think adding more tests for each type would be the way to go to cover all of the cases. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/cup/sql-parser.cup@3873 PS2, Line 3873: string_expr is needed here > This is also true for the version above, isn't it? Then it would be better I didn't catch similar problems with other syntaxes. It's more like: once string_expr needs to be introduced for this syntax anyway, why not use it for the others as well. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java File fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java: http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@19 PS2, Line 19: : import com.google.common.collect.Lists; : : import org.apache.impala.catalog.BuiltinsDb; > unused imports Removed, thanks. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@35 PS2, Line 35: > nit: the Done http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@41 PS2, Line 41: h > nit: a Done http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@45 PS2, Line 45: > This field shadows the field with the same name from the base class. We cou Agree, not a good practice. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@45 PS2, Line 45: // Trim option set. > We can reuse the fnName_ from FunctionCallExpr fnName_ here is "trim" or "utf8_trim", representing a phantom trim-from function. In FunctionCallExpr it resolves to a real ltrim/rtrim/btrim. Later each one goes through it's own analysis. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@48 PS2, Line 48: TRAILING, > could be srcExpr_ Indeed. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@51 PS2, Line 51: > This could instead be an enum. Agree, I tried to stay consistent with the twin ExtractFromExpr, but I think enum would do here. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@60 PS2, Line 60: > Nit: should start with lower-case letter. Fixed http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@69 PS2, Line 69: this(fnName, where, null, srcexpr); > Can't we raise an exception here? Technically we can of course, but conceptually this would mean a ParseException which is not the best idea to throw from a regular analysis/*.java file. If we were to check this on parsing stage, Imho throwing here from constructor wouldn't be the best place for that. As a better option, this check could be done whether from within the .cup's {} clause (there is a "parseError" routine for that) or by adding "VerifyTrimOption" somewhere in .java files and calling it from .cup. But checking "where" argument on parsing stage would assume doing the same for function name even prior to that, as well as reconsider the entire order of later analysis checks. And the same for ExtractFromExpr and potentially other places. There is a certain order in analysis hierarchy of functions, according to which different aspects of a statement are being analyzed so that in case of multiple problems, error is generated about the most important one. Arbitrarily breaking this order is not trivial change and eventually leads to a global question on what is a parsing error and what is an analysis error, which I doubt we want to be raising in scope of this issue. Please let me know if you have a better idea. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@98 PS2, Line 98: } > We should also override clone(), otherwise this copy constructor won't be u @Override seems to result in error for constructors. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@111 PS2, Line 111: lysisEx > It should be "trim()", shouldn't it? Fixed, thanks. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@113 PS2, Line 113: > This could be extracted into a variable. Since fnName_ is now renamed, it's perhaps more clear to access via modified name, and not to introduce another one. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@125 PS2, Line 125: .map(option -> "`" + option.name() + "`") > This error should be caught during parsing, then TrimOption could be an enu Answered in related threads. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@139 PS2, Line 139: null && srcExpr_.getType() != Type.STRING) { : if (errMsg.length() > 0) errMsg.append(" "); > This error message template could be extracted as a format string and reuse Replied in the twin comment. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@143 PS2, Line 143: } > If the block body firs on one line, we omit the braces in Impala. Agree, this could be perhaps specified in project's .clang-format http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/main/java/org/apache/impala/analysis/TrimFromExpr.java@144 PS2, Line 144: turn errMsg.toString(); : } > This could be extracted into a function that could also be used on L139. I would keep it as is just for two calls if you don't mind. Conceptually, each parameter may require it's own check, it's more of a coincidence that these two are so similar. http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java: http://gerrit.cloudera.org:8080/#/c/21825/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java@1927 PS2, Line 1927: // Special cases for TRIM in function call > You can also add parser tests for the syntax in 'ParserTest.java'. Ack http://gerrit.cloudera.org:8080/#/c/21825/2/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test File testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test: http://gerrit.cloudera.org:8080/#/c/21825/2/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test@398 PS2, Line 398: ---- QUERY > We could have some EE tests with utf8_trim() too. Will add. -- To view, visit http://gerrit.cloudera.org:8080/21825 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3c4fa6d0d8d0684c4b6d8dac8fd531d205e4f7b4 Gerrit-Change-Number: 21825 Gerrit-PatchSet: 3 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Comment-Date: Sat, 21 Sep 2024 12:29:26 +0000 Gerrit-HasComments: Yes
