Joe McDonnell has posted comments on this change. Change subject: IMPALA-4624: Implement Parquet dictionary filtering ......................................................................
Patch Set 16: (6 comments) Fixed an issue that was causing test failure in TestParquet::test_corrupt_files and addressed Tim's comments. I'm doing a pre review test job to check the changes. http://gerrit.cloudera.org:8080/#/c/5904/14//COMMIT_MSG Commit Message: Line 7: IMPALA-4624: Implement Parquet dictionary filtering > Can you mention the query option in the commit message? Done PS14, Line 12: incides > indices Done http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 704: state_->query_options().parquet_dictionary_filtering; > Can you use MemTracker::MemLimitExceeded() here to construct the status? It Done http://gerrit.cloudera.org:8080/#/c/5904/14/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS14, Line 256: GetDictionary > GetDictionaryDecoder() for consistency with the other APIs? Done Line 865: if (!stream_->ReadBytes(data_size, &data_, &status)) return status; > Not your change, but can we just SkipBytes here? That would make the intent Done http://gerrit.cloudera.org:8080/#/c/5904/14/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: Line 243: public boolean isDeterministic() { > I think we should be careful about the naming and comments here, since this Switched this over to isBuiltinRandom/isBuiltinRandomFn. I agree that it is useful to be specific about what this function can classify. The issue for dictionary filtering is that anything that is non-deterministic should not be applied as a dictionary filter. If we skip a whole row group based on some randomness, that can produce incorrect results. We should think about what we want to ship. If UDFs can be nondeterministic, we should think about disabling dictionary filtering for them. Things like now() are fine, as it is constant over an execution of a query. It would not impact correctness. -- To view, visit http://gerrit.cloudera.org:8080/5904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Mulder <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
