[Impala-CR](cdh5-trunk) IMPALA-3804: Push per-split filtering into scanners

Sailesh Mukil (Code Review) Tue, 19 Jul 2016 00:30:07 -0700

Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3804: Push per-split filtering into scanners
......................................................................



Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/3561/2//COMMIT_MSG
Commit Message:

PS2, Line 23: The Parquet scanner does something
            : different: it processes all splits for one file on the same 
thread, and
            : so marks all those splits as 'complete' very early on.
I think this is an incorrect statement. It only marks all files as "done", i.e. 
'num_unqueued_files_' will be decremented in IssueInitialRanges(); but the 
ranges themselves will be marked as complete only in 
HdfsParquetScanner::Close(). IssueInitialRanges() creates a footer range for 
every split.

The only case where all the ranges are marked as complete inside 
IssueInitialRanges() is when no slots have to be materialized. E.g: a count(*) 
query.

In any case, I think this code will still work fine regardless, because every 
split will not process the split (if the partition is to be filtered out) and 
mark RangeComplete() in Close() (just the same as the text scanner). But we 
need to change this description and other comments.


http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS2, Line 1875: IssueInitialRanges() will mark all ranges *except* the footer 
as complete.
It does this only if there are no slots to be materialized.


http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

PS2, Line 1137: PartitionIsFiltered
nit: This sounded a little confusing to me on first glance. I didn't know if 
we're trying to filter in, or filter out.

I think PartitionIsFilteredOut() would make things much clearer.


PS2, Line 1158: passed_filter
nit: Same here, instead of saying '!passed_filter', 'filtered_out' would be a 
lot easier to follow. As "passed" again doesn't say what exactly it passed, i.e 
passed being filtered in, or filtered out.


http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-scan-node.h
File be/src/exec/hdfs-scan-node.h:

PS2, Line 307: true
false


http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

PS2, Line 432: header
footer


PS2, Line 433: process remaining splits in the *same* scanner and thread
Again, I'm not sure how true this is.


PS2, Line 439: CLOSE_ONLY_THIS_RANGE,
             :     CLOSE_NONE
There doesn't seem to be a  distinction between the two. To me, they feel like 
they're almost the same thing.

A thread processing a text scanner's split calls FilterScanRange(CLOSE_NONE), 
but it is in fact closing it's own scan range, i.e. the scan range 
corresponding to that split. So what's the actual difference between them both?

Or to phrase the question in a different way, does calling 
scan_range()->Cancel() once we reach ProcessSplit() have any effect?


-- 
To view, visit http://gerrit.cloudera.org:8080/3561
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9f92178f642695e0e9ef901373a5e9f2878a78ce
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3804: Push per-split filtering into scanners

Reply via email to