[ 
https://issues.apache.org/jira/browse/IMPALA-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650711#comment-16650711
 ] 

Paul Rogers commented on IMPALA-7501:
-------------------------------------

The analysis above suggests we can prune other unused objects as well. In the 
following Thrift definition, fields that may not be needed (need to verify) 
include:

* {{inputFormat}}
* {{outputFormat}}
* {{serdeInfo}}
* {{sortCols}} (assuming Impala ignores partition sort information)
* {{parameters}} (assuming Impala ignores parameters)

In fact, as has been discussed several times, it might be better to cache the 
Impala-specific partition objects which, presumably, already store only the 
attributes that Impala uses.

{code}
struct StorageDescriptor {
  1: list<FieldSchema> cols,  // required (refer to types defined above)
  2: string location,         // defaults to <warehouse loc>/<db loc>/tablename
  3: string inputFormat,      // SequenceFileInputFormat (binary) or 
TextInputFormat`  or custom format
  4: string outputFormat,     // SequenceFileOutputFormat (binary) or 
IgnoreKeyTextOutputFormat or custom format
  5: bool   compressed,       // compressed or not
  6: i32    numBuckets,       // this must be specified if there are any 
dimension columns
  7: SerDeInfo    serdeInfo,  // serialization and deserialization information
  8: list<string> bucketCols, // reducer grouping columns and clustering 
columns and bucketing columns`
  9: list<Order>  sortCols,   // sort order of the data in each bucket
  10: map<string, string> parameters, // any user supplied key value hash
  11: optional SkewedInfo skewedInfo, // skewed information
  12: optional bool   storedAsSubDirectories       // stored as subdirectories 
or not
}
{code}

> Slim down metastore Partition objects in LocalCatalog cache
> -----------------------------------------------------------
>
>                 Key: IMPALA-7501
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7501
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Todd Lipcon
>            Priority: Minor
>
> I took a heap dump of an impalad running in LocalCatalog mode with a 2G limit 
> after running a production workload simulation for a couple hours. It had 
> 38.5M objects and 2.02GB heap (the vast majority of the heap is, as expected, 
> in the LocalCatalog cache). Of this total footprint, 1.78GB and 34.6M objects 
> are retained by 'Partition' objects. Drilling into those, 1.29GB and 33.6M 
> objects are retained by FieldSchema, which, as far as I remember, are ignored 
> on the partition level by the Impala planner. So, with a bit of slimming down 
> of these objects, we could make a huge dent in effective cache capacity given 
> a fixed budget. Reducing object count should also have the effect of improved 
> GC performance (old gen GC is more closely tied to object count than size)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to