[ 
https://issues.apache.org/jira/browse/IMPALA-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767748#comment-17767748
 ] 

Wenzhe Zhou commented on IMPALA-12456:
--------------------------------------

Looking into IMPALA-12456. The patch for IMPALA-7320 
(https://gerrit.cloudera.org/#/c/11027/) added a function 
HdfsTable.preloadPermissionsCache() to bulk fetch the permissions. From the 
calling path, this function is called when adding partitions for altering 
tables. See calling path below:
   HdfsTable.preloadPermissionsCache() is called by 
HdfsTable.createAndLoadPartitions(),
   HdfsTable.createAndLoadPartitions() is called by 
CatalogOpExecutor.addHdfsPartitions(), which is in turn called by
   CatalogOpExecutor.alterTableAddPartitions() and 
CatalogOpExecutor.alterTableRecoverPartitions().
But HdfsTable.preloadPermissionsCache() added a restriction
{code:java}
    // Only preload permissions if the number of partitions to be added is
    // large (3x) relative to the number of existing partitions. This covers
    // two common cases:
    //
    // 1) initial load of a table (no existing partition metadata)
    // 2) ALTER TABLE RECOVER PARTITIONS after creating a table pointing to
    // an already-existing partition directory tree
    //
    // Without this heuristic, we would end up using a "listStatus" call to
    // potentially fetch a bunch of irrelevant information about existing
    // partitions when we only want to know about a small number of newly-added
    // partitions.
    if (msPartitions.size() < partitionMap_.size() * 3) return permCache;
{code}
The optimization is not applied to the cases if the number of partitions to be 
added is not larger than 3x of existing partitions. That is the reason we don't 
see the optimization for altering table in some cases. But it's risky to simply 
relax this heuristic condition.

HdfsTable.preloadPermissionsCache() has TODO optimization
{code:java}
// TODO(todd): when HDFS-13616 (batch listing of multiple directories)
// is implemented, we could likely implement this with a single round
// trip.
{code}
HDFS-13616 was implemented. But it was reported to have bug when using the 
batch method with more than 1000 files.

> Apply optimization in IMPALA-7320 during event processing
> ---------------------------------------------------------
>
>                 Key: IMPALA-12456
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12456
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> I saw some callstacks where table loading triggered by alter table events 
> dominated most threads. It turned out that the optimization in IMPALA-7320 is 
> not applied there and a file listing is initiated for each partition, even if 
> a single recursive listing could read all data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to