Sailesh Mukil has posted comments on this change. Change subject: IMPALA-3804: Push per-split filtering into scanners ......................................................................
Patch Set 2: (8 comments) http://gerrit.cloudera.org:8080/#/c/3561/2//COMMIT_MSG Commit Message: PS2, Line 23: The Parquet scanner does something : different: it processes all splits for one file on the same thread, and : so marks all those splits as 'complete' very early on. I think this is an incorrect statement. It only marks all files as "done", i.e. 'num_unqueued_files_' will be decremented in IssueInitialRanges(); but the ranges themselves will be marked as complete only in HdfsParquetScanner::Close(). IssueInitialRanges() creates a footer range for every split. The only case where all the ranges are marked as complete inside IssueInitialRanges() is when no slots have to be materialized. E.g: a count(*) query. In any case, I think this code will still work fine regardless, because every split will not process the split (if the partition is to be filtered out) and mark RangeComplete() in Close() (just the same as the text scanner). But we need to change this description and other comments. http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: PS2, Line 1875: IssueInitialRanges() will mark all ranges *except* the footer as complete. It does this only if there are no slots to be materialized. http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-scan-node.cc File be/src/exec/hdfs-scan-node.cc: PS2, Line 1137: PartitionIsFiltered nit: This sounded a little confusing to me on first glance. I didn't know if we're trying to filter in, or filter out. I think PartitionIsFilteredOut() would make things much clearer. PS2, Line 1158: passed_filter nit: Same here, instead of saying '!passed_filter', 'filtered_out' would be a lot easier to follow. As "passed" again doesn't say what exactly it passed, i.e passed being filtered in, or filtered out. http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-scan-node.h File be/src/exec/hdfs-scan-node.h: PS2, Line 307: true false http://gerrit.cloudera.org:8080/#/c/3561/2/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: PS2, Line 432: header footer PS2, Line 433: process remaining splits in the *same* scanner and thread Again, I'm not sure how true this is. PS2, Line 439: CLOSE_ONLY_THIS_RANGE, : CLOSE_NONE There doesn't seem to be a distinction between the two. To me, they feel like they're almost the same thing. A thread processing a text scanner's split calls FilterScanRange(CLOSE_NONE), but it is in fact closing it's own scan range, i.e. the scan range corresponding to that split. So what's the actual difference between them both? Or to phrase the question in a different way, does calling scan_range()->Cancel() once we reach ProcessSplit() have any effect? -- To view, visit http://gerrit.cloudera.org:8080/3561 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9f92178f642695e0e9ef901373a5e9f2878a78ce Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Henry Robinson <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
