Csaba Ringhofer created ORC-612:
-----------------------------------
Summary: Improve CHAR(N)/VARCHAR(N) support in predicate push down
Key: ORC-612
URL: https://issues.apache.org/jira/browse/ORC-612
Project: ORC
Issue Type: Improvement
Components: C++
Reporter: Csaba Ringhofer
This came up during the implementation of min/max filters in Apache Impala:
https://gerrit.cloudera.org/#/c/15403/ by [~norbertluksa]
Impala reads CHAR(N)/VARCHAR(N) the following way:
0. push down min/max predicates (on review)
1. read the value as STRING from ORC
2. truncate the value if it is longer than N or pad it with spaces if the type
is char and the value is short than N
3. evalute the predicates once all columns are read
It is possible that a value from ORC does not satisfy the predicates before
truncation/padding, but it does afterwards. For example:
a single column: "s VARCHAR(1)"
a single value in the ORC file : "aa"
a predicate: s="a" .
"aa" does not pass the predicate, but after truncation it becomes "a", which
passes.
Currently it is tricky to push this predicate down, as simply passing s="a"
would skip the file as min="aa" > "a". (what could work is pushing s>="a" AND
s<"b" instead, as all values that can be truncated to "a" are <= "b").
It would be much simpler (for us at least) if we could pass the max length and
min length to the SARGS interface, and it would apply truncation/padding to the
min/max statistics before comparing them to the literal we provided. So in the
example above min=max="aa" would become "a", and it would satisfy the pushed
down s="a".
Note that Impala doesn't care about encoding, so the length is byte length.
Other clients may need UTF-8 length instead.
Apart from min/max stats, CHAR/VARCHAR are also problematic for bloom filters -
in the example above, "aa"'s hash is probably different than "a"'s, so looking
up "a" could fail.
Bloom filters could only work if we are sure that there won't be any
truncation/padding (which is actually quite likely if the schema didn't change,
as DB system enforces this during writing). If there were stats about min/max
length of strings, then it would be possible verify this during predicate push
down and use bloom filters if it is safe.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)