[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 Created another pull request #1072to merge my changes with the one done with Padma's. ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user priteshm commented on the issue: https://github.com/apache/drill/pull/1001 @sachouche can you update this PR? ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1001 @priteshm @sachouche This PR needs to be updated on top of changes made for DRILL-5899. ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user priteshm commented on the issue: https://github.com/apache/drill/pull/1001 @ppadma @paul-rogers I see that @sachouche addressed the comments in the JIRA - is this one ready to merge? ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 Paul, again thanks for the detailed review: - I was able to address most of the feedback except for one - I agree that expressions that can operate directly on the encoded UTF-8 string should ideally perform checks on bytes and not characters - Having said that, such a change is more involved and should be done properly o The SqlPatternContainsMatcher currently gets a CharSequence as input o We should enhance the expression framework so that matchers can a) express their capabilities and b) receive the expected data type (Character or Byte sequences) o Note also there is an impact on the test-suite since StringBuffer are being used to directly test the matcher functionality ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 Paul, - I think you misunderstood the proposal - Let me use an example - select .. c1 like '%pattern1%' OR c1 like '%pattern2'.. - Assume c1 has 3 values [v1, v2, v3] - The generated code will create a new VarcharHolder instance for each value iteration - For v1: VarCharHolder vc1 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value - For v2: VarCharHolder vc1 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value - DITO for v3 Note that the test-suite has similar test cases that you are proposing; they all passed. ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1001 - It was 50% for 1) and 50% for 2) - Notice this breakdown depends on o The number of Contains pattern for the same value (impacts 1)) o The pattern length (impacts both 1) and 2)) ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1001 @sachouche, thanks for the first PR to Drill! Thanks for the detailed explanation! Before reviewing the code, a comment on the design: > Added a new integer variable "asciiMode" ... this value will be set ... during the first LIKE evaluation and will be reused across other LIKE evaluations The problem with this design is that there is no guarantee that the first value is representative of the other columns. Maybe my list looks like this: ``` Hello ä½ å¥½ ``` The first value is ASCII. The second is not. So, we must treat each value as independent of the others. On the other hand, we *can* exploit the nature of UTF-8. The encoding is such that no valid UTF-8 character is a prefix of any other valid character. Thus, if a character is 0xXX 0xYY 0xZZ, then there can *never* be a valid character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and contains can be done without either converting to UTF-16 or even caring if the data is ASCII or not. What does this mean? It means that, for the simple operations: 1. Convert the Java UTF-16 string to UTF-8. 2. Do the classic byte comparison methods for starts with, ends with or contains. No special processing is needed for multi-byte Unlike other multi-byte encodings, UTF-8 was designed to make this possible. If we go this route, we would not need the ASCII mode flag. Note: all of this applies only to the "basic four" operations: if we do a real regex, then we must decode the Varchar into a Java UTF-16 string. ---
[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1001 @sachouche Do you have a breakdown of how much gain we got with 1 vs 2. Since the changes for 2 are not straightforward and easy to maintain, I am thinking performance gain vs. maintainability of the code. ---