[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Alex Behm (Code Review) Mon, 23 Jan 2017 22:39:06 -0800

Alex Behm has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................



Patch Set 2:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/5776/2//COMMIT_MSG
Commit Message:

Line 11: since function names act as bare identifiers.
Can you also run a basic perf test to compare the speed of regex_replace() vs. 
replace()? The motivation for replace() is to be significantly faster in simple 
cases than the next best alternative. You could try something along the lines 
of:

select count(replace(l_comment, " ", "")) from tpch_parquet.lineitem;

but probably with a larger lineitem dataset. Please reach out if you need help 
creating a larger dataset.

Might also be interesting to see how a stupid version with std::replace() would 
compare.


http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

Line 1954:   TestStringValue("replace('abc', 'abcd', 'a')", "abc");
suggest these additional tests:
1. nonsense replace
replace("abcdefg", "z", "abcdefg");
2. empty input string
replace("", "z", "");
3. input string is NULL, rest non-NULL
TestStringValue("replace(NULL, 'abc', 'a')", "abc");
4. output string is bigger than the input string
TestStringValue("replace('aaa', 'a', 'aa')", "aaaaaa");
5. add test with exactly 16 matches and some trailing chars in the input that 
are non-matches


http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 210:   if (pattern.len == 0) return str;
pattern can be null as well


Line 218:   const int kMemoizedMatches = 16;
style: k_memoized_matches or max_memoized_matches

add comment: why this memoization? why only 16?


Line 225:       if (match_pos_in_substring < 0)
single line


Line 237:   const int rlen = num_matches * (replace.len - needle.len) + str.len;
This could overflow and lead to writing past the result buffer.

Suggest using a larger result type and add a test. The SQL function "repeat" 
might come in handy.


Line 249:       DCHECK(match_pos_in_substring >= 0);
Must there really always be a match here?

(also DCHECK_GE)


Line 253:     int bytes = match_pos - consumed;
non_match_len?


Line 266:   memcpy(ptr, &haystack.ptr[match_pos], str.len - consumed);
// Copy trailing non-matches.


http://gerrit.cloudera.org:8080/#/c/5776/2/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

Line 3050: ident_or_restricted ::=
Agree about the issues.

Why not handle the REPLACE case like IF and TRUNCATE? Have a look at the 
non_pred_expr production.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Reply via email to