Zoltan Ivanfi has posted comments on this change.

Change subject: IMPALA-3973: add position and occurrence to instr()
......................................................................


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 289:   StringValue str_sv = StringValue::FromStringVal(str);
             :   StringValue substr_sv = StringValue::FromStringVal(substr);
             :   StringSearch search(&substr_sv);
             :   // Hive returns positions starting from 1.
             :   return IntVal(search.Search(&str_sv) + 1);
             : }
> I think it makes sense to unify this by using the new method. If there's bu
Done


PS5, Line 329: n + start_pos
> I think we could just remove the comment. The next line by itself explains 
Done


http://gerrit.cloudera.org:8080/#/c/4094/6/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS6, Line 311: search_start_pos < 0
> remove- this is not possible since we know it's not 0 from l302 and we know
Indeed, sorry - in an earlier version of my code this check was done in a place 
common to the regular and reverse search.


PS6, Line 311: {
             :       return IntVal(0);
> this will be short enough to return on the same line
Done


PS6, Line 318:       if (match_pos_in_substring == -1) {
             :         return IntVal(0);
             :       }
> 1 line
Done


PS6, Line 322: search_start_pos
> I think you also need to make sure search_start_pos is < haystack.len befor
I don't think it's necessary. I check the starting position before the loop, so 
we start with a non-negative length for the Substring call. Then in each 
iteration of the loop, one of two things can happen:
1. There is no match in the substring, in which case the function returns 
immediately.
2. There is a match, in which case the new starting position is one char after 
the match position. Because the match position must be inside the string, 
match_pos < haystack.len, therefore search_start_pos = match_pos + 1 <= 
haystack.len (actually it will be strictly less, because the length of the 
match must be > 0).


PS6, Line 330:  search_start_pos >= str.len
> This isn't possible, str.len = haystack.len, and we know we added something
Indeed, sorry - in an earlier version of my code this check was done in a place 
common to the regular and reverse search.


PS6, Line 330:     if (search_start_pos < 0 || search_start_pos >= str.len) {
             :       return IntVal(0);
             :     }
> remove
We still need the < 0 check.


PS6, Line 336: search_start_pos + needle.len
> why do you add the needle len?
Because the position of the match is the index of its leftmost char, but we 
need to include the match itself in the substring where we search. For example, 
if we are searching for "234" in "0123456" and we want to find matches where 
the match index is <= 2, the substring "012" is not enough, we have to search 
in "01234".


PS6, Line 338:       if (match_pos == -1) {
             :         return IntVal(0);
             :       }
> 1 line
Done


http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/exprs/string-functions.h
File be/src/exprs/string-functions.h:

PS5, Line 64: context
> nit: remove to match the surrounding code.
Done


http://gerrit.cloudera.org:8080/#/c/4094/5/be/src/runtime/string-search-test.cc
File be/src/runtime/string-search-test.cc:

Line 28:   return StringValue(const_cast<char*>(str), len);
> You're right. Unfortunately the style guide wasn't clear in this regard. Wh
Done


PS5, Line 69: ECT_EQ(-1, TestSearch("abca
> I see, let's fix it here then at least. Please rephrase to "Test searching 
Done


PS5, Line 70: 
> Ok, makes sense to me. Can you add this as a comment to the factory methods
Done


Line 74:   EXPECT_EQ(-1, TestSearch("abcdabaaaaa", "abaaaaaa", 4, 3));
> Same here
Done


Line 81:   EXPECT_EQ(2, TestSearch("abcdede", "cde"));
> How about "Tests with needle at the end of the haystack." then? Doesn't mak
Done


http://gerrit.cloudera.org:8080/#/c/4094/6/be/src/runtime/string-search.h
File be/src/runtime/string-search.h:

PS6, Line 102:     for (int i = 0; i < mlast; ++i) {
             :       BloomAdd(pattern_->ptr[i]);
             :       if (pattern_->ptr[i] == pattern_->ptr[mlast])
             :         skip_ = mlast - i - 1;
             :     }
             :     BloomAdd(pattern_->ptr[mlast]);
> I don't think this will always work for RSearch, see https://hg.python.org/
Nice catch, thanks. I think we may be able to do better than the Python 
implementation here. That one iterates in reverse indeed, and for this reason 
they moved this part immediately before the search, since the direction is not 
known during construction. However, since the search may be reused for 
searching repeatedly, it would be better to build the bloom filter only once.

The contents of the bloom filter do not depend on the order of adding elements 
to it. The only difference between iterating forwards vs. backwards is the 
value of skip_. It will always depend on the last position (in the order of 
iterating) where the pattern contains the same char as the last char of the 
pattern (again, in the order of iterating). This means that in a single loop we 
can populate the bloom filter and save two different skip_ value - one for 
searching forwards and one for searching backwards.

I will create a test that catches this error then modify the code the way I 
described above.


-- 
To view, visit http://gerrit.cloudera.org:8080/4094
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zoltan Ivanfi <[email protected]>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to