[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2018-01-10 Thread sachouche
Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Created another pull request #1072to merge my changes with the one done 
with Padma's.


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2018-01-09 Thread priteshm
Github user priteshm commented on the issue:

https://github.com/apache/drill/pull/1001
  
@sachouche can you update this PR?


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-11-16 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1001
  
@priteshm @sachouche This PR needs to be updated on top of changes made for 
DRILL-5899.


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-11-16 Thread priteshm
Github user priteshm commented on the issue:

https://github.com/apache/drill/pull/1001
  
@ppadma @paul-rogers I see that @sachouche addressed the comments in the 
JIRA - is this one ready to merge?


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-23 Thread sachouche
Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Paul, again thanks for the detailed review:

- I was able to address most of the feedback except for one
- I agree that expressions that can operate directly on the encoded UTF-8 
string should ideally perform  checks on bytes and not characters
- Having said that, such a change is more involved and should be done 
properly
   o The SqlPatternContainsMatcher currently gets a CharSequence as input
   o We should enhance the expression framework so that matchers can a) 
express their capabilities and b) receive the expected data type (Character or 
Byte sequences)
  o Note also there is an impact on the test-suite since StringBuffer are 
being used to directly test the matcher functionality 


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread sachouche
Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Paul,

- I think you misunderstood the proposal
- Let me use an example
- select .. c1 like '%pattern1%' OR c1 like '%pattern2'..
- Assume c1 has 3 values [v1, v2, v3]
- The generated code will create a new VarcharHolder instance for each 
value iteration
- For v1: VarCharHolder vc1 is created, ascii-mode computed for pattern1, 
ascii-mode computation reused for pattern2, pattern3, etc since we're 
evaluating the same value
- For v2: VarCharHolder vc1 is created, ascii-mode computed for pattern1, 
ascii-mode computation reused for pattern2, pattern3, etc since we're 
evaluating the same value
- DITO for v3

Note that the test-suite has similar test cases that you are proposing; 
they all passed.  


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread sachouche
Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
- It was 50% for 1) and 50% for 2)
- Notice this breakdown depends on 
   o The number of Contains pattern for the same value (impacts 1))
   o The pattern length (impacts both 1) and 2))  


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1001
  
@sachouche, thanks for the first PR to Drill! Thanks for the detailed 
explanation!

Before reviewing the code, a comment on the design:

> Added a new integer variable "asciiMode" ... this value will be set ... 
during the first LIKE evaluation and will be reused across other LIKE 
evaluations

The problem with this design is that there is no guarantee that the first 
value is representative of the other columns. Maybe my list looks like this:

```
Hello
你好
```

The first value is ASCII. The second is not. So, we must treat each value 
as independent of the others.

On the other hand, we *can* exploit the nature of UTF-8. The encoding is 
such that no valid UTF-8 character is a prefix of any other valid character. 
Thus, if a character is 0xXX 0xYY 0xZZ, then there can *never* be a valid 
character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and 
contains can be done without either converting to UTF-16 or even caring if the 
data is ASCII or not.

What does this mean? It means that, for the simple operations:

1. Convert the Java UTF-16 string to UTF-8.
2. Do the classic byte comparison methods for starts with, ends with or 
contains. No special processing is needed for multi-byte

Unlike other multi-byte encodings, UTF-8 was designed to make this possible.

If we go this route, we would not need the ASCII mode flag.

Note: all of this applies only to the "basic four" operations: if we do a 
real regex, then we must decode the Varchar into a Java UTF-16 string.


---


[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1001
  
@sachouche Do you have a breakdown of how much gain we got with 1 vs 2.  
Since the changes for 2 are not straightforward and easy to maintain, I am 
thinking  performance gain vs. maintainability of the code. 


---