[ 
https://issues.apache.org/jira/browse/IMPALA-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757633#comment-16757633
 ] 

ASF subversion and git services commented on IMPALA-7540:
---------------------------------------------------------

Commit f20a03a7b1bc2a9bb6cd8b54b8afb9ce384538f1 in impala's branch 
refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f20a03a ]

IMPALA-7540. Intern most repetitive strings and network addresses in catalog

This adds interning to a bunch of repeated strings in catalog objects,
including:
- table name
- DB name
- owner
- column names
- input/output formats
- parameter keys
- common parameter values ("true", "false", etc)
- HBase column family names

Additionally, it interns TNetworkAddresses, so that each datanode host
is only stored once rather than having its own copy in each table.

I verified this patch using jxray on the development catalogd and
impalad. The following lines are removed entirely from the "duplicate
strings" report:

 Overhead   # char[]s # objects  Value
 164K (0.3%)     2,635   2,635  "127.0.0.1"
 97K (0.2%)      1,038   1,038  "__HIVE_DEFAULT_PARTITION__"
 95K (0.2%)      1,111   1,111  "transient_lastDdlTime"
 92K (0.1%)      1,975   1,975  "d"
 70K (0.1%)      997     997    "EXTERNAL_TABLE"
 56K (< 0.1%)    1,201   1,201  "todd"
 54K (< 0.1%)    998     998    "EXTERNAL"
 46K (< 0.1%)    998     998    "TRUE"
 44K (< 0.1%)    567     567    "numFilesErasureCoded"
 38K (< 0.1%)    612     612    "totalSize"
 30K (< 0.1%)    567     567    "numFiles"

The following are reduced substantially:

Before: 72K (0.1%)      1,543   1,543  "1"
After:  47K (< 0.1%)    1,009   1,009  "1"

A few large strings remain in the report that may be worth addressing, depending
on whether we think production catalogs exhibit the same repetitions:

1) Avro schemas, eg:
 204K (0.3%)     3       3      "{"fields": [{"type": ["boolean", "null"], 
"name": "bool_col1"}, {"type": ["int", "null"], "name": "tinyint_col1"}, 
{"type": ...[length 52429]"

(in the development catalog there are multiple tables with the same Avro
schema)

2) Partition location suffixes, eg:
 144K (0.2%)     1,234   1,234  "many_blocks_num_blocks_per_partition_1"
 17K (< 0.1%)    230     230    "year=2009/month=2"
 17K (< 0.1%)    230     230    "year=2009/month=3"
 17K (< 0.1%)    230     230    "year=2009/month=1"

(in the development catalog lots of tables have the same partitioning
layout)

3) Unsure (jxray isn't reporting the reference chain, but seems likely
   to be partition values):
 49K (< 0.1%)    1,058   1,058  "2010"
 28K (< 0.1%)    612     612    "2009"
 27K (< 0.1%)    585     585    "0"
 22K (< 0.1%)    71      899    ""

Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
Reviewed-on: http://gerrit.cloudera.org:8080/11158
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Intern common strings in catalog
> --------------------------------
>
>                 Key: IMPALA-7540
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7540
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 3.1.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Major
>
> Using jxray shows that there are many common duplicate strings in the 
> catalog. For example, each table repeats the database name, and metadata like 
> the HMS parameter maps reuse a lot of common strings like "EXTERNAL" or 
> "transient_lastDdlTime". We should intern these to save memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to