Alex Behm has posted comments on this change. Change subject: IMPALA-4729: Implement REPLACE() ......................................................................
Patch Set 2: (10 comments) http://gerrit.cloudera.org:8080/#/c/5776/2//COMMIT_MSG Commit Message: Line 11: since function names act as bare identifiers. Can you also run a basic perf test to compare the speed of regex_replace() vs. replace()? The motivation for replace() is to be significantly faster in simple cases than the next best alternative. You could try something along the lines of: select count(replace(l_comment, " ", "")) from tpch_parquet.lineitem; but probably with a larger lineitem dataset. Please reach out if you need help creating a larger dataset. Might also be interesting to see how a stupid version with std::replace() would compare. http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: Line 1954: TestStringValue("replace('abc', 'abcd', 'a')", "abc"); suggest these additional tests: 1. nonsense replace replace("abcdefg", "z", "abcdefg"); 2. empty input string replace("", "z", ""); 3. input string is NULL, rest non-NULL TestStringValue("replace(NULL, 'abc', 'a')", "abc"); 4. output string is bigger than the input string TestStringValue("replace('aaa', 'a', 'aa')", "aaaaaa"); 5. add test with exactly 16 matches and some trailing chars in the input that are non-matches http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: Line 210: if (pattern.len == 0) return str; pattern can be null as well Line 218: const int kMemoizedMatches = 16; style: k_memoized_matches or max_memoized_matches add comment: why this memoization? why only 16? Line 225: if (match_pos_in_substring < 0) single line Line 237: const int rlen = num_matches * (replace.len - needle.len) + str.len; This could overflow and lead to writing past the result buffer. Suggest using a larger result type and add a test. The SQL function "repeat" might come in handy. Line 249: DCHECK(match_pos_in_substring >= 0); Must there really always be a match here? (also DCHECK_GE) Line 253: int bytes = match_pos - consumed; non_match_len? Line 266: memcpy(ptr, &haystack.ptr[match_pos], str.len - consumed); // Copy trailing non-matches. http://gerrit.cloudera.org:8080/#/c/5776/2/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: Line 3050: ident_or_restricted ::= Agree about the issues. Why not handle the REPLACE case like IF and TRUNCATE? Have a look at the non_pred_expr production. -- To view, visit http://gerrit.cloudera.org:8080/5776 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-HasComments: Yes
