Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................


Patch Set 16:

(6 comments)

Fixed an issue that was causing test failure in TestParquet::test_corrupt_files 
and addressed Tim's comments.

I'm doing a pre review test job to check the changes.

http://gerrit.cloudera.org:8080/#/c/5904/14//COMMIT_MSG
Commit Message:

Line 7: IMPALA-4624: Implement Parquet dictionary filtering
> Can you mention the query option in the commit message?
Done


PS14, Line 12: incides
> indices
Done


http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 704:       state_->query_options().parquet_dictionary_filtering;
> Can you use MemTracker::MemLimitExceeded() here to construct the status? It
Done


http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS14, Line 256: GetDictionary
> GetDictionaryDecoder() for consistency with the other APIs?
Done


Line 865:     if (!stream_->ReadBytes(data_size, &data_, &status)) return 
status;
> Not your change, but can we just SkipBytes here? That would make the intent
Done


http://gerrit.cloudera.org:8080/#/c/5904/14/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 243:   public boolean isDeterministic() {
> I think we should be careful about the naming and comments here, since this
Switched this over to isBuiltinRandom/isBuiltinRandomFn. I agree that it is 
useful to be specific about what this function can classify.

The issue for dictionary filtering is that anything that is non-deterministic 
should not be applied as a dictionary filter. If we skip a whole row group 
based on some randomness, that can produce incorrect results. 

We should think about what we want to ship. If UDFs can be nondeterministic, we 
should think about disabling dictionary filtering for them.

Things like now() are fine, as it is constant over an execution of a query. It 
would not impact correctness.


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Mulder <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to