Philip Zeyliger has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11517 )

Change subject: [WIP] IMPALA-6932: Speed up scans for sequence datasets with 
many files
......................................................................


Patch Set 3:

> This can't be tested on hdfs since there are no "remote" blocks in
 > the minicluster. So all the scan ranges of a file are added to the
 > appropriate local disk queue once the header is processed.

This came up in a conversation between me and Joe today as well. Replication in 
HDFS is per file, so we should be able to "hdfs put" with appropriate options 
to induce a remote block, even in the minicluster. Unfortunately, it doesn't 
seem to work with the following sequence:

$ impala-shell.sh -q 'create table t (x string)'
$ yes | head > /tmp/f
$ hadoop fs -D dfs.replication=1 -put /tmp/f /test-warehouse/t
$ impala-shell.sh -i localhost:21002 -q 'set num_nodes=1; invalidate metadata 
t; select * from t limit 2; profile' | grep -i BytesReadShortCircuit

Impala seems to be doing short-circuit-read on all the impalad's (presumably 
because the datanode somewhat reasonably decides things are indeed local).

Anyway--this surprised me so I figured I'd mention it.


--
To view, visit http://gerrit.cloudera.org:8080/11517
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I211e2511ea3bb5edea29f1bd63e6b1fa4c4b1965
Gerrit-Change-Number: 11517
Gerrit-PatchSet: 3
Gerrit-Owner: Pooja Nilangekar <[email protected]>
Gerrit-Reviewer: Bikramjeet Vig <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Philip Zeyliger <[email protected]>
Gerrit-Reviewer: Pooja Nilangekar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Tue, 30 Oct 2018 03:36:10 +0000
Gerrit-HasComments: No

Reply via email to