[
https://issues.apache.org/jira/browse/IMPALA-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated IMPALA-13783:
------------------------------------
Description:
In addHdfsPartitionsToCatalogDelta(), there is a loop of collecting updates on
new partitions:
{code:java}
long maxSentId = hdfsTable.getMaxSentPartitionId();
for (TCatalogObject catalogPart :
hdfsTable.getNewPartitionsSinceLastUpdate()) {
maxSentId = Math.max(maxSentId, catalogPart.getHdfs_partition().getId());
ctx.addCatalogObject(catalogPart, false, updateSummary);
}
hdfsTable.setMaxSentPartitionId(maxSentId);
{code}
[https://github.com/apache/impala/blob/83452d640fd27410befaf31346eda15febf624f0/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L1794]
hdfsTable.getNewPartitionsSinceLastUpdate() returns a temp list of the
THdfsPartition objects which contain the file descriptors (THdfsFileDesc) and
hms parameters. These won't be used in local catalog mode. However, they could
hold a large memory footprint for a table with lots of partitions.
I test a table with 6M partitions and 6M files (one file per partition), the
HdfsTable object itself takes 7.3GB in memory. However, this temp list takes
8.9GB when catalogd collects updates of the table after it's just loaded (so
all partitions are new).
!catalog-update-thread.png|width=419,height=244!!HdfsTable.png|height=244!
Array of the TCatalogObject that contains the THdfsPartitions (address is
0x60ebb1b70):
!THdfsPartition.png|width=654,height=261!
Stacktrace of the catalog-update thread:
!stacktrace.png|width=1052,height=424!
was:
In addHdfsPartitionsToCatalogDelta(), there is a loop of collecting updates on
new partitions:
{code:java}
long maxSentId = hdfsTable.getMaxSentPartitionId();
for (TCatalogObject catalogPart :
hdfsTable.getNewPartitionsSinceLastUpdate()) {
maxSentId = Math.max(maxSentId, catalogPart.getHdfs_partition().getId());
ctx.addCatalogObject(catalogPart, false, updateSummary);
}
hdfsTable.setMaxSentPartitionId(maxSentId);
{code}
hdfsTable.getNewPartitionsSinceLastUpdate() returns a temp list of the
THdfsPartition objects which contain the file descriptors (THdfsFileDesc) and
hms parameters. These won't be used in local catalog mode. However, they could
hold a large memory footprint for a table with lots of partitions.
I test a table with 6M partitions and 6M files (one file per partition), the
HdfsTable object itself takes 7.3GB in memory. However, this temp list takes
8.9GB when catalogd collects updates of the table after it's just loaded (so
all partitions are new).
!catalog-update-thread.png!
> Catalog update thread might hold a large memory in
> addHdfsPartitionsToCatalogDelta
> ----------------------------------------------------------------------------------
>
> Key: IMPALA-13783
> URL: https://issues.apache.org/jira/browse/IMPALA-13783
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
> Attachments: HdfsTable.png, THdfsPartition.png,
> catalog-update-thread.png, stacktrace.png
>
>
> In addHdfsPartitionsToCatalogDelta(), there is a loop of collecting updates
> on new partitions:
> {code:java}
> long maxSentId = hdfsTable.getMaxSentPartitionId();
> for (TCatalogObject catalogPart :
> hdfsTable.getNewPartitionsSinceLastUpdate()) {
> maxSentId = Math.max(maxSentId,
> catalogPart.getHdfs_partition().getId());
> ctx.addCatalogObject(catalogPart, false, updateSummary);
> }
> hdfsTable.setMaxSentPartitionId(maxSentId);
> {code}
> [https://github.com/apache/impala/blob/83452d640fd27410befaf31346eda15febf624f0/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L1794]
> hdfsTable.getNewPartitionsSinceLastUpdate() returns a temp list of the
> THdfsPartition objects which contain the file descriptors (THdfsFileDesc) and
> hms parameters. These won't be used in local catalog mode. However, they
> could hold a large memory footprint for a table with lots of partitions.
> I test a table with 6M partitions and 6M files (one file per partition), the
> HdfsTable object itself takes 7.3GB in memory. However, this temp list takes
> 8.9GB when catalogd collects updates of the table after it's just loaded (so
> all partitions are new).
> !catalog-update-thread.png|width=419,height=244!!HdfsTable.png|height=244!
> Array of the TCatalogObject that contains the THdfsPartitions (address is
> 0x60ebb1b70):
> !THdfsPartition.png|width=654,height=261!
> Stacktrace of the catalog-update thread:
> !stacktrace.png|width=1052,height=424!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]