Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/11517 )
Change subject: [WIP] IMPALA-6932: Speed up scans for sequence datasets with many files ...................................................................... Patch Set 3: > This can't be tested on hdfs since there are no "remote" blocks in > the minicluster. So all the scan ranges of a file are added to the > appropriate local disk queue once the header is processed. This came up in a conversation between me and Joe today as well. Replication in HDFS is per file, so we should be able to "hdfs put" with appropriate options to induce a remote block, even in the minicluster. Unfortunately, it doesn't seem to work with the following sequence: $ impala-shell.sh -q 'create table t (x string)' $ yes | head > /tmp/f $ hadoop fs -D dfs.replication=1 -put /tmp/f /test-warehouse/t $ impala-shell.sh -i localhost:21002 -q 'set num_nodes=1; invalidate metadata t; select * from t limit 2; profile' | grep -i BytesReadShortCircuit Impala seems to be doing short-circuit-read on all the impalad's (presumably because the datanode somewhat reasonably decides things are indeed local). Anyway--this surprised me so I figured I'd mention it. -- To view, visit http://gerrit.cloudera.org:8080/11517 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I211e2511ea3bb5edea29f1bd63e6b1fa4c4b1965 Gerrit-Change-Number: 11517 Gerrit-PatchSet: 3 Gerrit-Owner: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Reviewer: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 30 Oct 2018 03:36:10 +0000 Gerrit-HasComments: No
