Paul Rogers has uploaded a new patch set (#7) to the change originally created 
by Todd Lipcon. ( http://gerrit.cloudera.org:8080/11227 )

Change subject: IMPALA-7047. Refreshing partitions should not make an RPC per 
file
......................................................................

IMPALA-7047. Refreshing partitions should not make an RPC per file

The code to handle REFRESH of a single partition was incorrectly
ignoring the previously-known file descriptors. This meant that, instead
of only calling 'getFileBlockLocations' on the files that had changed
since the prior load, it was instead calling it on every file.

In addition to refresh of single partitions this also affected refresh
of unpartitioned tables (which is implemented as a refresh of its single
"default" partition).

This patch fixes the behavior by copying over the existing file
descriptor list into the re-created partition before refreshing it. A
new unit test uses FS statistics to verify the change. The new
assertions act as a regression test and fail if I comment out the fix.

Additionally, this fixes the case where the old partition had no files
to use the optimized 'listLocatedStatus' call. This is triggered when
'REFRESH' picks up a new partition from the HMS added by an external
system.

I also tested this by pointing my dev box at a remote filesystem that
was approximately 60ms away. The initial load of an unpartitioned table
with approximately 45000 files takes around 23 seconds in this setup.
Without the patch in place, REFRESH was taking upwards of 35 minutes (I
got tired and gave up at this point). Multiplying the 60ms round trip by
45000 files estimates 45 minutes. With the fix in place, REFRESH of the
same table took around 4.5 seconds.

Clearly, in typical setups where catalogd and HDFS are on a shared local
network, the gains won't be so dramatic. But, even with a 1ms round trip
(plausible when including fixed RPC overhead and potentially congested
datacenter networks) this would save 45 seconds on this example table
with 45000 files.

Change-Id: I2051b96599206164aaa06ecbdf64374c46eda956
---
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
3 files changed, 82 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/11227/7
--
To view, visit http://gerrit.cloudera.org:8080/11227
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2051b96599206164aaa06ecbdf64374c46eda956
Gerrit-Change-Number: 11227
Gerrit-PatchSet: 7
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <par0...@yahoo.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>

Reply via email to