Henry Robinson has uploaded a new patch set (#2). Change subject: IMPALA-3804: Push per-split filtering into scanners ......................................................................
IMPALA-3804: Push per-split filtering into scanners IMPALA-3798 was a bug that occurred when a header split was filtered out, without correctly cancelling all the scan ranges in the rest of the file. To fix this properly, we have to make the scanners aware of per-split filtering, since different scanners need to compensate for a filtered scan in different ways. For example, sequence-based scanners (such as Avro) only issue most of a file's ranges after the header range has been scanner. Therefore if a header split is filtered out, all the remaining ranges can be safely marked as complete. If a non-header split is filtered, it may not be safe to mark as complete a split that may be concurrent scanned by a different scanner. The text scanner issues all ranges at once, so it is only safe to mark the current range as complete. The Parquet scanner does something different: it processes all splits for one file on the same thread, and so marks all those splits as 'complete' very early on. This patch adds HdfsScanner::FilterScanRange() which should be called by ProcessSplit(). FilterScanRange() will return true if the scan range should not be scanned, and accepts a policy parameter that describes what compensation action to perform (close all scan ranges, only the current one, or none). Testing: * Added logic to test_sequence_file_filtering_race to check that per-scan filtering was happening correctly, confirming that the rewritten path was taking effect. Expanded the test to hit all scanner types. * Manually tested existing runtime filters test suite with file filtering disabled, and rewriting tests to expect split filtering instead of file filtering. Tests passed. Change-Id: I9f92178f642695e0e9ef901373a5e9f2878a78ce --- M be/src/exec/base-sequence-scanner.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-scan-node.cc M be/src/exec/hdfs-scan-node.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-scanner.h M be/src/exec/scanner-context.h M tests/custom_cluster/test_seq_file_filtering.py 10 files changed, 99 insertions(+), 82 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/61/3561/2 -- To view, visit http://gerrit.cloudera.org:8080/3561 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9f92178f642695e0e9ef901373a5e9f2878a78ce Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Henry Robinson <[email protected]>
