[ https://issues.apache.org/jira/browse/IMPALA-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757633#comment-16757633 ]
ASF subversion and git services commented on IMPALA-7540: --------------------------------------------------------- Commit f20a03a7b1bc2a9bb6cd8b54b8afb9ce384538f1 in impala's branch refs/heads/master from Todd Lipcon [ https://gitbox.apache.org/repos/asf?p=impala.git;h=f20a03a ] IMPALA-7540. Intern most repetitive strings and network addresses in catalog This adds interning to a bunch of repeated strings in catalog objects, including: - table name - DB name - owner - column names - input/output formats - parameter keys - common parameter values ("true", "false", etc) - HBase column family names Additionally, it interns TNetworkAddresses, so that each datanode host is only stored once rather than having its own copy in each table. I verified this patch using jxray on the development catalogd and impalad. The following lines are removed entirely from the "duplicate strings" report: Overhead # char[]s # objects Value 164K (0.3%) 2,635 2,635 "127.0.0.1" 97K (0.2%) 1,038 1,038 "__HIVE_DEFAULT_PARTITION__" 95K (0.2%) 1,111 1,111 "transient_lastDdlTime" 92K (0.1%) 1,975 1,975 "d" 70K (0.1%) 997 997 "EXTERNAL_TABLE" 56K (< 0.1%) 1,201 1,201 "todd" 54K (< 0.1%) 998 998 "EXTERNAL" 46K (< 0.1%) 998 998 "TRUE" 44K (< 0.1%) 567 567 "numFilesErasureCoded" 38K (< 0.1%) 612 612 "totalSize" 30K (< 0.1%) 567 567 "numFiles" The following are reduced substantially: Before: 72K (0.1%) 1,543 1,543 "1" After: 47K (< 0.1%) 1,009 1,009 "1" A few large strings remain in the report that may be worth addressing, depending on whether we think production catalogs exhibit the same repetitions: 1) Avro schemas, eg: 204K (0.3%) 3 3 "{"fields": [{"type": ["boolean", "null"], "name": "bool_col1"}, {"type": ["int", "null"], "name": "tinyint_col1"}, {"type": ...[length 52429]" (in the development catalog there are multiple tables with the same Avro schema) 2) Partition location suffixes, eg: 144K (0.2%) 1,234 1,234 "many_blocks_num_blocks_per_partition_1" 17K (< 0.1%) 230 230 "year=2009/month=2" 17K (< 0.1%) 230 230 "year=2009/month=3" 17K (< 0.1%) 230 230 "year=2009/month=1" (in the development catalog lots of tables have the same partitioning layout) 3) Unsure (jxray isn't reporting the reference chain, but seems likely to be partition values): 49K (< 0.1%) 1,058 1,058 "2010" 28K (< 0.1%) 612 612 "2009" 27K (< 0.1%) 585 585 "0" 22K (< 0.1%) 71 899 "" Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26 Reviewed-on: http://gerrit.cloudera.org:8080/11158 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Intern common strings in catalog > -------------------------------- > > Key: IMPALA-7540 > URL: https://issues.apache.org/jira/browse/IMPALA-7540 > Project: IMPALA > Issue Type: Bug > Affects Versions: Impala 3.1.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Major > > Using jxray shows that there are many common duplicate strings in the > catalog. For example, each table repeats the database name, and metadata like > the HMS parameter maps reuse a lot of common strings like "EXTERNAL" or > "transient_lastDdlTime". We should intern these to save memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org