Paul Rogers has uploaded a new patch set (#4) to the change originally created by Todd Lipcon. ( http://gerrit.cloudera.org:8080/11158 )
Change subject: IMPALA-7540. Intern most repetitive strings and network addresses in catalog ...................................................................... IMPALA-7540. Intern most repetitive strings and network addresses in catalog This adds interning to a bunch of repeated strings in catalog objects, including: - table name - DB name - owner - column names - input/output formats - parameter keys - common parameter values ("true", "false", etc) - HBase column family names Additionally, it interns TNetworkAddresses, so that each datanode host is only stored once rather than having its own copy in each table. I verified this patch using jxray on the development catalogd and impalad. The following lines are removed entirely from the "duplicate strings" report: Overhead # char[]s # objects Value 164K (0.3%) 2,635 2,635 "127.0.0.1" 97K (0.2%) 1,038 1,038 "__HIVE_DEFAULT_PARTITION__" 95K (0.2%) 1,111 1,111 "transient_lastDdlTime" 92K (0.1%) 1,975 1,975 "d" 70K (0.1%) 997 997 "EXTERNAL_TABLE" 56K (< 0.1%) 1,201 1,201 "todd" 54K (< 0.1%) 998 998 "EXTERNAL" 46K (< 0.1%) 998 998 "TRUE" 44K (< 0.1%) 567 567 "numFilesErasureCoded" 38K (< 0.1%) 612 612 "totalSize" 30K (< 0.1%) 567 567 "numFiles" The following are reduced substantially: Before: 72K (0.1%) 1,543 1,543 "1" After: 47K (< 0.1%) 1,009 1,009 "1" A few large strings remain in the report that may be worth addressing, depending on whether we think production catalogs exhibit the same repetitions: 1) Avro schemas, eg: 204K (0.3%) 3 3 "{"fields": [{"type": ["boolean", "null"], "name": "bool_col1"}, {"type": ["int", "null"], "name": "tinyint_col1"}, {"type": ...[length 52429]" (in the development catalog there are multiple tables with the same Avro schema) 2) Partition location suffixes, eg: 144K (0.2%) 1,234 1,234 "many_blocks_num_blocks_per_partition_1" 17K (< 0.1%) 230 230 "year=2009/month=2" 17K (< 0.1%) 230 230 "year=2009/month=3" 17K (< 0.1%) 230 230 "year=2009/month=1" (in the development catalog lots of tables have the same partitioning layout) 3) Unsure (jxray isn't reporting the reference chain, but seems likely to be partition values): 49K (< 0.1%) 1,058 1,058 "2010" 28K (< 0.1%) 612 612 "2009" 27K (< 0.1%) 585 585 "0" 22K (< 0.1%) 71 899 "" Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26 --- A fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java M fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/Table.java 5 files changed, 250 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/11158/4 -- To view, visit http://gerrit.cloudera.org:8080/11158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26 Gerrit-Change-Number: 11158 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com>