[
https://issues.apache.org/jira/browse/PARQUET-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689543#comment-17689543
]
ASF GitHub Bot commented on PARQUET-2245:
-----------------------------------------
wgtmac commented on code in PR #1029:
URL: https://github.com/apache/parquet-mr/pull/1029#discussion_r1108041382
##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java:
##########
@@ -187,10 +196,7 @@ public <T extends Comparable<T>> Boolean visit(NotEq<T>
notEq) {
try {
Set<T> dictSet = expandDictionary(meta);
- boolean mayContainNull = (meta.getStatistics() == null
- || !meta.getStatistics().isNumNullsSet()
- || meta.getStatistics().getNumNulls() > 0);
- if (dictSet != null && dictSet.size() == 1 && dictSet.contains(value) &&
!mayContainNull) {
+ if (dictSet != null && dictSet.size() == 1 && dictSet.contains(value)) {
Review Comment:
I just noticed that the `FilterPredicate` does not provide an entry for `IS
NULL` or `IS NOT NULL`. This confuses me because `col IS NOT NULL` is not equal
to `col != NULL`.
CMIW, `col NOT EQ A` has two meanings as below:
- If A is NULL, it should return an empty list. Because NULL cannot be
compared to any value including another NULL.
- Otherwise, it should return a list of values excluding A and NULL.
cc @huaxingao @gszadovszky @shangxinli
> Improve dictionary filter evaluating notEq
> ------------------------------------------
>
> Key: PARQUET-2245
> URL: https://issues.apache.org/jira/browse/PARQUET-2245
> Project: Parquet
> Issue Type: Improvement
> Reporter: Yujiang Zhong
> Priority: Minor
>
> When evaluating `notEq`, if the column may contain nulls and the `notEq`
> value is non-null, the row-group must not be skipped. In such scenario
> reading dictionary and compare values is not necessary.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)