Csaba Ringhofer created ORC-612:
-----------------------------------

             Summary: Improve CHAR(N)/VARCHAR(N) support in predicate push down
                 Key: ORC-612
                 URL: https://issues.apache.org/jira/browse/ORC-612
             Project: ORC
          Issue Type: Improvement
          Components: C++
            Reporter: Csaba Ringhofer


This came up during the implementation of min/max filters in Apache Impala: 
https://gerrit.cloudera.org/#/c/15403/ by [~norbertluksa]

Impala reads CHAR(N)/VARCHAR(N) the following way:
0. push down min/max predicates (on review)
1. read the value as STRING from ORC
2. truncate the value if it is longer than N or pad it with spaces if the type 
is char and the value is short than N
3. evalute the predicates once all columns are read

It is possible that a value from ORC does not satisfy the predicates before 
truncation/padding, but it does afterwards. For example:
a single column: "s VARCHAR(1)"
a single value in the ORC file : "aa"
a predicate: s="a" .
"aa" does not pass the predicate, but after truncation it becomes "a", which 
passes.

Currently it is tricky to push this predicate down, as simply passing s="a" 
would skip the file as min="aa" > "a". (what could work is pushing s>="a" AND 
s<"b" instead, as all values that can be truncated to "a" are <= "b").

It would be much simpler (for us at least) if we could pass the max length and 
min length to the SARGS interface, and it would apply truncation/padding to the 
min/max statistics before comparing them to the literal we provided. So in the 
example above min=max="aa" would become "a", and it would satisfy the pushed 
down s="a".

Note that Impala doesn't care about encoding, so the length is byte length. 
Other clients may need UTF-8 length instead.

Apart from min/max stats, CHAR/VARCHAR are also problematic for bloom filters - 
in the example above, "aa"'s hash is probably different than "a"'s, so looking 
up "a" could fail.
Bloom filters could only work if we are sure that there won't be any 
truncation/padding (which is actually quite likely if the schema didn't change, 
as DB system enforces this during writing). If there were stats about min/max 
length of strings, then it would be possible verify this during predicate push 
down and use bloom filters if it is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to