[
https://issues.apache.org/jira/browse/IMPALA-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang resolved IMPALA-11812.
-------------------------------------
Fix Version/s: Impala 4.3.0
Resolution: Fixed
Resolving this. Thank [~amansinha] and [~sql_forever] for the review!
> Catalogd OOM due to lots of HMS FieldSchema instances
> -----------------------------------------------------
>
> Key: IMPALA-11812
> URL: https://issues.apache.org/jira/browse/IMPALA-11812
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
> Fix For: Impala 4.3.0
>
> Attachments: MAT_dominator_tree.png,
> create_ext_tbl_with_5k_cols_50k_parts.sh,
> wide-table-loading-oom-FieldSchema-path2root.png,
> wide-table-loading-oom-histogram.png, wide-table-loading-oom-top-consumers.png
>
>
> For partitioned wide tables that have thousands of columns, catalogd might
> hit OOM in routines on them. E.g. when running AlterTableRecoverPartitions
> for all their partitions, or when initially loading all partitions of them.
> The direct reason is that the heap is full of HMS FieldSchema instances. Here
> is a histogram of the issue in a 4GB heap:
> {noformat}
> Class Name | Objects |
> Shallow Heap |
> --------------------------------------------------------------------------------------------
> org.apache.hadoop.hive.metastore.api.FieldSchema | 111,876,486 |
> 2,685,035,664 |
> java.lang.Object[] | 78,026 |
> 449,929,656 |
> char[] | 91,295 |
> 6,241,744 |
> java.util.ArrayList | 171,126 |
> 4,107,024 |
> java.util.HashMap | 71,135 |
> 3,414,480 |
> java.lang.String | 91,161 |
> 2,187,864 |
> java.util.concurrent.ConcurrentHashMap$Node | 59,614 |
> 1,907,648 |
> java.util.concurrent.atomic.LongAdder | 53,021 |
> 1,696,672 |
> org.apache.hadoop.hive.metastore.api.Partition | 22,374 |
> 1,610,928 |
> com.codahale.metrics.EWMA | 30,780 |
> 1,477,440 |
> com.codahale.metrics.LongAdderProxy$JdkProvider$1 | 53,021 |
> 1,272,504 |
> org.apache.hadoop.hive.metastore.api.StorageDescriptor | 22,376 |
> 1,253,056 |
> java.util.Hashtable$Entry | 36,921 |
> 1,181,472 |
> java.util.concurrent.atomic.AtomicLong | 39,444 |
> 946,656 |
> org.apache.hadoop.hive.metastore.api.SerDeInfo | 22,375 |
> 895,000 |
> byte[] | 1,686 |
> 668,480 |
> java.util.concurrent.ConcurrentHashMap$Node[] | 1,874 |
> 639,824 |
> com.codahale.metrics.ExponentiallyDecayingReservoir | 10,259 |
> 574,504 |
> java.util.HashMap$Node | 17,776 |
> 568,832 |
> org.apache.hadoop.hive.metastore.api.SkewedInfo | 22,375 |
> 537,000 |
> com.codahale.metrics.Meter | 10,260 |
> 492,480 |
> java.util.concurrent.ConcurrentSkipListMap | 10,260 |
> 492,480 |
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync| 10,259 |
> 492,432 |
> org.apache.impala.catalog.ColumnStats | 5,003 |
> 400,240 |
> Total: 24 of 6,158 entries; 6,130 more | 113,007,927 |
> 3,174,330,288 | {noformat}
> In the above case, these FieldSchema instances come from the list of
> hmsPartitions that is created locally by
> CatalogOpExecutor#alterTableRecoverPartitions(). The thread is 0x6d051abb8:
> !MAT_dominator_tree.png|width=825,height=308!
> Stacktrace:
> {noformat}
> Thread 0x6d051abb8
> at
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.<init>(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V
> (StorageDescriptor.java:216)
> at
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.deepCopy()Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;
> (StorageDescriptor.java:256)
> at
> org.apache.impala.service.CatalogOpExecutor.createHmsPartitionFromValues(Ljava/util/List;Lorg/apache/hadoop/hive/metastore/api/Table;Lorg/apache/impala/analysis/TableName;Ljava/lang/String;)Lorg/apache/hadoop/hive/metastore/api/Partition;
> (CatalogOpExecutor.java:5787)
> at
> org.apache.impala.service.CatalogOpExecutor.alterTableRecoverPartitions(Lorg/apache/impala/catalog/Table;Ljava/lang/String;)V
> (CatalogOpExecutor.java:5678)
> at
> org.apache.impala.service.CatalogOpExecutor.alterTable(Lorg/apache/impala/thrift/TAlterTableParams;Ljava/lang/String;ZLorg/apache/impala/thrift/TDdlExecResponse;)V
> (CatalogOpExecutor.java:1208)
> at
> org.apache.impala.service.CatalogOpExecutor.execDdlRequest(Lorg/apache/impala/thrift/TDdlExecRequest;)Lorg/apache/impala/thrift/TDdlExecResponse;
> (CatalogOpExecutor.java:419)
> at org.apache.impala.service.JniCatalog.execDdl([B)[B
> (JniCatalog.java:260){noformat}
> *How this happen*
> When creating the list of hmsPartitions, we deep copy the StorageDescriptor
> which will also deep copy the column list:
> {code:java}
> alterTableRecoverPartitions()
> -> createHmsPartitionFromValues()
> -> StorageDescriptor sd = msTbl.getSd().deepCopy();{code}
> Impala doesn't respect the partition level schema (by design), we should
> share the list of FieldSchema across hmsPartitions.
> When loading partition metadata for such a table, we could also hit this
> issue. The HMS API "get_partitions_by_names" returns the list of
> hmsPartitions. Each of them reference a unique list of FieldSchemas. We
> should deduplicate them to share the same column list. FWIW, attached the
> heap analysis results for wide table loading OOM (4GB heap):
> !wide-table-loading-oom-histogram.png|width=623,height=695!
> !wide-table-loading-oom-top-consumers.png|width=1066,height=434!
> The FieldSchema instances come from the metadata loading thread:
> !wide-table-loading-oom-FieldSchema-path2root.png|width=1023,height=213!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)