[
https://issues.apache.org/jira/browse/IMPALA-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629410#comment-16629410
]
Paul Rogers commented on IMPALA-7501:
-------------------------------------
Analysis:
* Impala's {{LocalCatalog}} contains a list of {{FeDb}} objects.
* Impala's {{LocalDb}}, which extends {{FeDb}} contains a map of {{LocalTable}}
objects.
* Impala's {{LocalTable}} contains a Hive {{Table}} object.
* Hive's
[{{Table}}|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java]
is a Hive-defined class, which contains a {{TableSpec}}.
* Hive's
[{{TableSpec}}|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java]
contains a list of {{Partition}} objects.
* Hive's
[{{Partition}}|https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Partition.java]
is generated from Thrift. Contains a {{StorageDescriptor}}.
* Hive's
[{{StorageDescriptor}}|https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/StorageDescriptor.java]
contains the list of {{FieldSchema}} objects which Todd saw in the heap dump.
The [Hive Thrift
schema|https://github.com/apache/hive/blob/3287a097e31063cc805ca55c2ca7defffe761b6f/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift]
is an easier way to visualize Hive part of the above analysis.
A quick scan of the Hive code suggests that Hive's Thrift objects carry more
info that is required in the Impala cache. Creating Impala-specific,
high-performance versions would likely save space. (No need for parent
pointers, no need for the two-level Hive API structure, etc.)
So, this gives us two options:
* Reach inside Hive's Thrift objects to null out fields which we don't need, or
* Design an Impala-specific, compact representation for the data that omits all
but essential objects and fields.
The second choice provides a huge opportunity for memory optimization. The
first is a crude-but-effective short-term solution.
> Slim down metastore Partition objects in LocalCatalog cache
> -----------------------------------------------------------
>
> Key: IMPALA-7501
> URL: https://issues.apache.org/jira/browse/IMPALA-7501
> Project: IMPALA
> Issue Type: Sub-task
> Reporter: Todd Lipcon
> Priority: Minor
>
> I took a heap dump of an impalad running in LocalCatalog mode with a 2G limit
> after running a production workload simulation for a couple hours. It had
> 38.5M objects and 2.02GB heap (the vast majority of the heap is, as expected,
> in the LocalCatalog cache). Of this total footprint, 1.78GB and 34.6M objects
> are retained by 'Partition' objects. Drilling into those, 1.29GB and 33.6M
> objects are retained by FieldSchema, which, as far as I remember, are ignored
> on the partition level by the Impala planner. So, with a bit of slimming down
> of these objects, we could make a huge dent in effective cache capacity given
> a fixed budget. Reducing object count should also have the effect of improved
> GC performance (old gen GC is more closely tied to object count than size)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]