[ 
https://issues.apache.org/jira/browse/IMPALA-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-11812:
------------------------------------
    Description: 
For partitioned wide tables that have thousands of columns, catalogd might hit 
OOM in routines on them. E.g. when running AlterTableRecoverPartitions for all 
their partitions, or when initially loading all partitions of them.

The direct reason is that the heap is full of HMS FieldSchema instances. Here 
is a histogram of the issue in a 4GB heap:
{noformat}
Class Name                                                   |     Objects |  
Shallow Heap |
--------------------------------------------------------------------------------------------
org.apache.hadoop.hive.metastore.api.FieldSchema             | 111,876,486 | 
2,685,035,664 |
java.lang.Object[]                                           |      78,026 |   
449,929,656 |
char[]                                                       |      91,295 |    
 6,241,744 |
java.util.ArrayList                                          |     171,126 |    
 4,107,024 |
java.util.HashMap                                            |      71,135 |    
 3,414,480 |
java.lang.String                                             |      91,161 |    
 2,187,864 |
java.util.concurrent.ConcurrentHashMap$Node                  |      59,614 |    
 1,907,648 |
java.util.concurrent.atomic.LongAdder                        |      53,021 |    
 1,696,672 |
org.apache.hadoop.hive.metastore.api.Partition               |      22,374 |    
 1,610,928 |
com.codahale.metrics.EWMA                                    |      30,780 |    
 1,477,440 |
com.codahale.metrics.LongAdderProxy$JdkProvider$1            |      53,021 |    
 1,272,504 |
org.apache.hadoop.hive.metastore.api.StorageDescriptor       |      22,376 |    
 1,253,056 |
java.util.Hashtable$Entry                                    |      36,921 |    
 1,181,472 |
java.util.concurrent.atomic.AtomicLong                       |      39,444 |    
   946,656 |
org.apache.hadoop.hive.metastore.api.SerDeInfo               |      22,375 |    
   895,000 |
byte[]                                                       |       1,686 |    
   668,480 |
java.util.concurrent.ConcurrentHashMap$Node[]                |       1,874 |    
   639,824 |
com.codahale.metrics.ExponentiallyDecayingReservoir          |      10,259 |    
   574,504 |
java.util.HashMap$Node                                       |      17,776 |    
   568,832 |
org.apache.hadoop.hive.metastore.api.SkewedInfo              |      22,375 |    
   537,000 |
com.codahale.metrics.Meter                                   |      10,260 |    
   492,480 |
java.util.concurrent.ConcurrentSkipListMap                   |      10,260 |    
   492,480 |
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync|      10,259 |    
   492,432 |
org.apache.impala.catalog.ColumnStats                        |       5,003 |    
   400,240 |                 
Total: 24 of 6,158 entries; 6,130 more                       | 113,007,927 | 
3,174,330,288 | {noformat}
In the above case, these FieldSchema instances come from the list of 
hmsPartitions that is created locally by 
CatalogOpExecutor#alterTableRecoverPartitions(). The thread is 0x6d051abb8:

!MAT_dominator_tree.png|width=825,height=308!

Stacktrace:
{noformat}
Thread 0x6d051abb8
  at 
org.apache.hadoop.hive.metastore.api.StorageDescriptor.<init>(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V
 (StorageDescriptor.java:216)
  at 
org.apache.hadoop.hive.metastore.api.StorageDescriptor.deepCopy()Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;
 (StorageDescriptor.java:256)
  at 
org.apache.impala.service.CatalogOpExecutor.createHmsPartitionFromValues(Ljava/util/List;Lorg/apache/hadoop/hive/metastore/api/Table;Lorg/apache/impala/analysis/TableName;Ljava/lang/String;)Lorg/apache/hadoop/hive/metastore/api/Partition;
 (CatalogOpExecutor.java:5787)
  at 
org.apache.impala.service.CatalogOpExecutor.alterTableRecoverPartitions(Lorg/apache/impala/catalog/Table;Ljava/lang/String;)V
 (CatalogOpExecutor.java:5678)
  at 
org.apache.impala.service.CatalogOpExecutor.alterTable(Lorg/apache/impala/thrift/TAlterTableParams;Ljava/lang/String;ZLorg/apache/impala/thrift/TDdlExecResponse;)V
 (CatalogOpExecutor.java:1208)
  at 
org.apache.impala.service.CatalogOpExecutor.execDdlRequest(Lorg/apache/impala/thrift/TDdlExecRequest;)Lorg/apache/impala/thrift/TDdlExecResponse;
 (CatalogOpExecutor.java:419)
  at org.apache.impala.service.JniCatalog.execDdl([B)[B 
(JniCatalog.java:260){noformat}
*How this happen*

When creating the list of hmsPartitions, we deep copy the StorageDescriptor 
which will also deep copy the column list:
{code:java}
alterTableRecoverPartitions()
-> createHmsPartitionFromValues()
   -> StorageDescriptor sd = msTbl.getSd().deepCopy();{code}
Impala doesn't respect the partition level schema (by design), we should share 
the list of FieldSchema across hmsPartitions.

When loading partition metadata for such a table, we could also hit this issue. 
The HMS API "get_partitions_by_names" returns the list of hmsPartitions. Each 
of them reference a unique list of FieldSchemas. We should deduplicate them to 
share the same column list. FWIW, attached the heap analysis results for wide 
table loading OOM (4GB heap):

!wide-table-loading-oom-histogram.png|width=623,height=695!

!wide-table-loading-oom-top-consumers.png|width=1066,height=434!

The FieldSchema instances come from the metadata loading thread:

!wide-table-loading-oom-FieldSchema-path2root.png|width=1023,height=213!

  was:
For partitioned wide tables that have thousands of columns, catalogd might hit 
OOM in routines on them. E.g. when running AlterTableRecoverPartitions for all 
their partitions, or when initially loading all partitions of them.

The direct reason is that the heap is full of HMS FieldSchema instances. Here 
is a histogram of the issue in a 4GB heap:
{noformat}
Class Name                                                   |     Objects |  
Shallow Heap |
--------------------------------------------------------------------------------------------
org.apache.hadoop.hive.metastore.api.FieldSchema             | 111,876,486 | 
2,685,035,664 |
java.lang.Object[]                                           |      78,026 |   
449,929,656 |
char[]                                                       |      91,295 |    
 6,241,744 |
java.util.ArrayList                                          |     171,126 |    
 4,107,024 |
java.util.HashMap                                            |      71,135 |    
 3,414,480 |
java.lang.String                                             |      91,161 |    
 2,187,864 |
java.util.concurrent.ConcurrentHashMap$Node                  |      59,614 |    
 1,907,648 |
java.util.concurrent.atomic.LongAdder                        |      53,021 |    
 1,696,672 |
org.apache.hadoop.hive.metastore.api.Partition               |      22,374 |    
 1,610,928 |
com.codahale.metrics.EWMA                                    |      30,780 |    
 1,477,440 |
com.codahale.metrics.LongAdderProxy$JdkProvider$1            |      53,021 |    
 1,272,504 |
org.apache.hadoop.hive.metastore.api.StorageDescriptor       |      22,376 |    
 1,253,056 |
java.util.Hashtable$Entry                                    |      36,921 |    
 1,181,472 |
java.util.concurrent.atomic.AtomicLong                       |      39,444 |    
   946,656 |
org.apache.hadoop.hive.metastore.api.SerDeInfo               |      22,375 |    
   895,000 |
byte[]                                                       |       1,686 |    
   668,480 |
java.util.concurrent.ConcurrentHashMap$Node[]                |       1,874 |    
   639,824 |
com.codahale.metrics.ExponentiallyDecayingReservoir          |      10,259 |    
   574,504 |
java.util.HashMap$Node                                       |      17,776 |    
   568,832 |
org.apache.hadoop.hive.metastore.api.SkewedInfo              |      22,375 |    
   537,000 |
com.codahale.metrics.Meter                                   |      10,260 |    
   492,480 |
java.util.concurrent.ConcurrentSkipListMap                   |      10,260 |    
   492,480 |
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync|      10,259 |    
   492,432 |
org.apache.impala.catalog.ColumnStats                        |       5,003 |    
   400,240 |                 
Total: 24 of 6,158 entries; 6,130 more                       | 113,007,927 | 
3,174,330,288 | {noformat}
In the above case, these FieldSchema instances come from the list of 
hmsPartitions that is created locally by 
CatalogOpExecutor#alterTableRecoverPartitions(). The thread is 0x6d051abb8:

!MAT_dominator_tree.png|width=825,height=308!

Stacktrace:
{noformat}
Thread 0x6d051abb8
  at 
org.apache.hadoop.hive.metastore.api.StorageDescriptor.<init>(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V
 (StorageDescriptor.java:216)
  at 
org.apache.hadoop.hive.metastore.api.StorageDescriptor.deepCopy()Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;
 (StorageDescriptor.java:256)
  at 
org.apache.impala.service.CatalogOpExecutor.createHmsPartitionFromValues(Ljava/util/List;Lorg/apache/hadoop/hive/metastore/api/Table;Lorg/apache/impala/analysis/TableName;Ljava/lang/String;)Lorg/apache/hadoop/hive/metastore/api/Partition;
 (CatalogOpExecutor.java:5787)
  at 
org.apache.impala.service.CatalogOpExecutor.alterTableRecoverPartitions(Lorg/apache/impala/catalog/Table;Ljava/lang/String;)V
 (CatalogOpExecutor.java:5678)
  at 
org.apache.impala.service.CatalogOpExecutor.alterTable(Lorg/apache/impala/thrift/TAlterTableParams;Ljava/lang/String;ZLorg/apache/impala/thrift/TDdlExecResponse;)V
 (CatalogOpExecutor.java:1208)
  at 
org.apache.impala.service.CatalogOpExecutor.execDdlRequest(Lorg/apache/impala/thrift/TDdlExecRequest;)Lorg/apache/impala/thrift/TDdlExecResponse;
 (CatalogOpExecutor.java:419)
  at org.apache.impala.service.JniCatalog.execDdl([B)[B 
(JniCatalog.java:260){noformat}
*How this happen*

When creating the list of hmsPartitions, we deep copy the StorageDescriptor 
which will also deep copy the column list:
{code:java}
alterTableRecoverPartitions()
-> createHmsPartitionFromValues()
   -> StorageDescriptor sd = msTbl.getSd().deepCopy();{code}
Impala doesn't respect the partition level schema (by design), we should share 
the list of FieldSchema across hmsPartitions.

When loading partition metadata for such a table, we could also hit this issue. 
The HMS API "get_partitions_by_names" returns the list of hmsPartitions. Each 
of them reference a unique list of FieldSchemas. We should deduplicate them to 
share the same column list.


> Catalogd OOM due to lots of HMS FieldSchema instances
> -----------------------------------------------------
>
>                 Key: IMPALA-11812
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11812
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>         Attachments: MAT_dominator_tree.png, 
> create_ext_tbl_with_5k_cols_50k_parts.sh, 
> wide-table-loading-oom-FieldSchema-path2root.png, 
> wide-table-loading-oom-histogram.png, wide-table-loading-oom-top-consumers.png
>
>
> For partitioned wide tables that have thousands of columns, catalogd might 
> hit OOM in routines on them. E.g. when running AlterTableRecoverPartitions 
> for all their partitions, or when initially loading all partitions of them.
> The direct reason is that the heap is full of HMS FieldSchema instances. Here 
> is a histogram of the issue in a 4GB heap:
> {noformat}
> Class Name                                                   |     Objects |  
> Shallow Heap |
> --------------------------------------------------------------------------------------------
> org.apache.hadoop.hive.metastore.api.FieldSchema             | 111,876,486 | 
> 2,685,035,664 |
> java.lang.Object[]                                           |      78,026 |  
>  449,929,656 |
> char[]                                                       |      91,295 |  
>    6,241,744 |
> java.util.ArrayList                                          |     171,126 |  
>    4,107,024 |
> java.util.HashMap                                            |      71,135 |  
>    3,414,480 |
> java.lang.String                                             |      91,161 |  
>    2,187,864 |
> java.util.concurrent.ConcurrentHashMap$Node                  |      59,614 |  
>    1,907,648 |
> java.util.concurrent.atomic.LongAdder                        |      53,021 |  
>    1,696,672 |
> org.apache.hadoop.hive.metastore.api.Partition               |      22,374 |  
>    1,610,928 |
> com.codahale.metrics.EWMA                                    |      30,780 |  
>    1,477,440 |
> com.codahale.metrics.LongAdderProxy$JdkProvider$1            |      53,021 |  
>    1,272,504 |
> org.apache.hadoop.hive.metastore.api.StorageDescriptor       |      22,376 |  
>    1,253,056 |
> java.util.Hashtable$Entry                                    |      36,921 |  
>    1,181,472 |
> java.util.concurrent.atomic.AtomicLong                       |      39,444 |  
>      946,656 |
> org.apache.hadoop.hive.metastore.api.SerDeInfo               |      22,375 |  
>      895,000 |
> byte[]                                                       |       1,686 |  
>      668,480 |
> java.util.concurrent.ConcurrentHashMap$Node[]                |       1,874 |  
>      639,824 |
> com.codahale.metrics.ExponentiallyDecayingReservoir          |      10,259 |  
>      574,504 |
> java.util.HashMap$Node                                       |      17,776 |  
>      568,832 |
> org.apache.hadoop.hive.metastore.api.SkewedInfo              |      22,375 |  
>      537,000 |
> com.codahale.metrics.Meter                                   |      10,260 |  
>      492,480 |
> java.util.concurrent.ConcurrentSkipListMap                   |      10,260 |  
>      492,480 |
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync|      10,259 |  
>      492,432 |
> org.apache.impala.catalog.ColumnStats                        |       5,003 |  
>      400,240 |                 
> Total: 24 of 6,158 entries; 6,130 more                       | 113,007,927 | 
> 3,174,330,288 | {noformat}
> In the above case, these FieldSchema instances come from the list of 
> hmsPartitions that is created locally by 
> CatalogOpExecutor#alterTableRecoverPartitions(). The thread is 0x6d051abb8:
> !MAT_dominator_tree.png|width=825,height=308!
> Stacktrace:
> {noformat}
> Thread 0x6d051abb8
>   at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.<init>(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V
>  (StorageDescriptor.java:216)
>   at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.deepCopy()Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;
>  (StorageDescriptor.java:256)
>   at 
> org.apache.impala.service.CatalogOpExecutor.createHmsPartitionFromValues(Ljava/util/List;Lorg/apache/hadoop/hive/metastore/api/Table;Lorg/apache/impala/analysis/TableName;Ljava/lang/String;)Lorg/apache/hadoop/hive/metastore/api/Partition;
>  (CatalogOpExecutor.java:5787)
>   at 
> org.apache.impala.service.CatalogOpExecutor.alterTableRecoverPartitions(Lorg/apache/impala/catalog/Table;Ljava/lang/String;)V
>  (CatalogOpExecutor.java:5678)
>   at 
> org.apache.impala.service.CatalogOpExecutor.alterTable(Lorg/apache/impala/thrift/TAlterTableParams;Ljava/lang/String;ZLorg/apache/impala/thrift/TDdlExecResponse;)V
>  (CatalogOpExecutor.java:1208)
>   at 
> org.apache.impala.service.CatalogOpExecutor.execDdlRequest(Lorg/apache/impala/thrift/TDdlExecRequest;)Lorg/apache/impala/thrift/TDdlExecResponse;
>  (CatalogOpExecutor.java:419)
>   at org.apache.impala.service.JniCatalog.execDdl([B)[B 
> (JniCatalog.java:260){noformat}
> *How this happen*
> When creating the list of hmsPartitions, we deep copy the StorageDescriptor 
> which will also deep copy the column list:
> {code:java}
> alterTableRecoverPartitions()
> -> createHmsPartitionFromValues()
>    -> StorageDescriptor sd = msTbl.getSd().deepCopy();{code}
> Impala doesn't respect the partition level schema (by design), we should 
> share the list of FieldSchema across hmsPartitions.
> When loading partition metadata for such a table, we could also hit this 
> issue. The HMS API "get_partitions_by_names" returns the list of 
> hmsPartitions. Each of them reference a unique list of FieldSchemas. We 
> should deduplicate them to share the same column list. FWIW, attached the 
> heap analysis results for wide table loading OOM (4GB heap):
> !wide-table-loading-oom-histogram.png|width=623,height=695!
> !wide-table-loading-oom-top-consumers.png|width=1066,height=434!
> The FieldSchema instances come from the metadata loading thread:
> !wide-table-loading-oom-FieldSchema-path2root.png|width=1023,height=213!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to