Quanlong Huang created IMPALA-13783:
---------------------------------------

             Summary: Catalog update thread might hold a large memory in 
addHdfsPartitionsToCatalogDelta
                 Key: IMPALA-13783
                 URL: https://issues.apache.org/jira/browse/IMPALA-13783
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang


In addHdfsPartitionsToCatalogDelta(), there is a loop of collecting updates on 
new partitions:
{code:java}
    long maxSentId = hdfsTable.getMaxSentPartitionId();
    for (TCatalogObject catalogPart : 
hdfsTable.getNewPartitionsSinceLastUpdate()) {
      maxSentId = Math.max(maxSentId, catalogPart.getHdfs_partition().getId());
      ctx.addCatalogObject(catalogPart, false, updateSummary);
    }
    hdfsTable.setMaxSentPartitionId(maxSentId);
{code}
hdfsTable.getNewPartitionsSinceLastUpdate() returns a temp list of the 
THdfsPartition objects which contain the file descriptors (THdfsFileDesc) and 
hms parameters. These won't be used in local catalog mode. However, they could 
hold a large memory footprint for a table with lots of partitions.

I test a table with 6M partitions and 6M files (one file per partition), the 
HdfsTable object itself takes 7.3GB in memory. However, this temp list takes 
8.9GB when catalogd collects updates of the table after it's just loaded (so 
all partitions are new).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to