Quanlong Huang created IMPALA-13783:
---------------------------------------
Summary: Catalog update thread might hold a large memory in
addHdfsPartitionsToCatalogDelta
Key: IMPALA-13783
URL: https://issues.apache.org/jira/browse/IMPALA-13783
Project: IMPALA
Issue Type: Bug
Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang
In addHdfsPartitionsToCatalogDelta(), there is a loop of collecting updates on
new partitions:
{code:java}
long maxSentId = hdfsTable.getMaxSentPartitionId();
for (TCatalogObject catalogPart :
hdfsTable.getNewPartitionsSinceLastUpdate()) {
maxSentId = Math.max(maxSentId, catalogPart.getHdfs_partition().getId());
ctx.addCatalogObject(catalogPart, false, updateSummary);
}
hdfsTable.setMaxSentPartitionId(maxSentId);
{code}
hdfsTable.getNewPartitionsSinceLastUpdate() returns a temp list of the
THdfsPartition objects which contain the file descriptors (THdfsFileDesc) and
hms parameters. These won't be used in local catalog mode. However, they could
hold a large memory footprint for a table with lots of partitions.
I test a table with 6M partitions and 6M files (one file per partition), the
HdfsTable object itself takes 7.3GB in memory. However, this temp list takes
8.9GB when catalogd collects updates of the table after it's just loaded (so
all partitions are new).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)