Quanlong Huang created IMPALA-13818: ---------------------------------------
Summary: Wasting space due to duplidated TColumnType between THdfsPartitions Key: IMPALA-13818 URL: https://issues.apache.org/jira/browse/IMPALA-13818 Project: IMPALA Issue Type: Bug Components: Backend, Frontend Reporter: Quanlong Huang Assignee: Quanlong Huang Attachments: TDescriptorTable.png, THdfsPartition_mem_footprint-1.png THdfsPartition is used in at least two places: * In the legacy catalog mode, catalogd propagates metadata to coordinators. * In TQueryExecRequest, it's used in the tableDescriptors sent to the backend as part of query plans and fragments. It has a list of expressions to represent the partition values: {code:java} struct THdfsPartition { // These are Literal expressions 7: list<Exprs.TExpr> partitionKeyExprs ... }{code} E.g. here is a partition (year=2009, month=1) of the functional.alltypes table: {noformat} THdfsPartition { 07: partitionKeyExprs (list) = list<struct>[2] { [0] = TExpr { 01: nodes (list) = list<struct>[1] { [0] = TExprNode { 01: node_type (i32) = 2, 02: type (struct) = TColumnType { 01: types (list) = list<struct>[1] { [0] = TTypeNode { 01: type (i32) = 0, 02: scalar_type (struct) = TScalarType { 01: type (i32) = 5, }, }, }, }, 03: num_children (i32) = 0, 04: is_constant (bool) = true, 11: int_literal (struct) = TIntLiteral { 01: value (i64) = 2009, }, 23: is_codegen_disabled (bool) = false, }, }, }, [1] = TExpr { 01: nodes (list) = list<struct>[1] { [0] = TExprNode { 01: node_type (i32) = 2, 02: type (struct) = TColumnType { 01: types (list) = list<struct>[1] { [0] = TTypeNode { 01: type (i32) = 0, 02: scalar_type (struct) = TScalarType { 01: type (i32) = 5, }, }, }, }, 03: num_children (i32) = 0, 04: is_constant (bool) = true, 11: int_literal (struct) = TIntLiteral { 01: value (i64) = 1, }, 23: is_codegen_disabled (bool) = false, }, }, }, }, 09: file_desc (list) = list<struct>[1] { [0] = THdfsFileDesc { 01: file_desc_data (string) = "\x18\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0e\x00 \x00\x1c\x00\x10\x00\x00\x00\b\x00\x04\x00\x0e\x00\x00\x00\x1c\x00\x00\x00\xce\x80\x17\xf3\x93\x01\x00\x00\xd1O\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\f\x00\x00\x00\x01\x00\x00\x00 \x00\x00\x00\n\x00\x00\x00090101.txt\x00\x00\f\x00\x14\x00\x00\x00\f\x00\b\x00\x04\x00\f\x00\x00\x00\x10\x00\x00\x00\x18\x00\x00\x00\xd1O\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x01\x00\x02\x00\x00\x00\x03\x00\x00\x00\x02\x00\x01\x00\x00\x00\x00\x00", }, }, 10: location (struct) = THdfsPartitionLocation { 01: prefix_index (i32) = 0, 02: suffix (string) = "year=2009/month=1", }, 11: access_level (i32) = 1, 12: stats (struct) = TTableStats { 01: num_rows (i64) = 310, }, 13: is_marked_cached (bool) = false, 14: id (i64) = 1, 15: hms_parameters (map) = map<string,string>[8] { "STATS_GENERATED" -> "TASK", "impala.events.catalogServiceId" -> "3378db64f1864d2a:afdb738fe029883c", "impala.events.catalogVersion" -> "12277", "numFiles" -> "1", "numFilesErasureCoded" -> "0", "numRows" -> "310", "totalSize" -> "20433", "transient_lastDdlTime" -> "1734950224", }, 16: num_blocks (i64) = 1, 17: total_file_size_bytes (i64) = 20433, 19: has_incremental_stats (bool) = false, 25: partition_name (string) = "year=2009/month=1", 26: prev_id (i64) = -1, 27: hdfs_storage_descriptor (struct) = THdfsStorageDescriptor { 01: lineDelim (byte) = 0x0a, 02: fieldDelim (byte) = 0x2c, 03: collectionDelim (byte) = 0x2c, 04: mapKeyDelim (byte) = 0x2c, 05: escapeChar (byte) = 0x5c, 06: quoteChar (byte) = 0x2c, 07: fileFormat (i32) = 0, 08: blockSize (i32) = 0, }, }{noformat} The partitionKeyExprs actually duplicates the partition column types, which is a waste in memory or network transmission. Here is an example from a heap dump: !THdfsPartition_mem_footprint.png|width=655,height=447! The query reads a single table of 6M partitions and 6M files. It hits OOM of exceeding the 2GB JVM byte array limit in serializing the DescriptorTable: {noformat} java.lang.Thread @ 0x7fdf6d220768 : Thread-12 at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48) at java.util.Arrays.copyOf([BI)[B (Arrays.java:3236) at java.io.ByteArrayOutputStream.toByteArray()[B (ByteArrayOutputStream.java:191) at org.apache.thrift.TSerializer.serialize(Lorg/apache/thrift/TBase;)[B (TSerializer.java:85) at org.apache.impala.common.JniUtil.serializeToThrift(Lorg/apache/thrift/TBase;)[B (JniUtil.java:109) at org.apache.impala.analysis.DescriptorTable.toSerializedThrift()Lorg/apache/impala/thrift/TDescriptorTableSerialized; (DescriptorTable.java:260) at org.apache.impala.service.Frontend.getPlannedExecRequest(Lorg/apache/impala/service/Frontend$PlanCtx;Lorg/apache/impala/analysis/AnalysisContext$AnalysisResult;Lorg/apache/impala/util/EventSequence;)Lorg/apache/impala/thrift/TQueryExecRequest; (Frontend.java:3138) at org.apache.impala.service.Frontend.doCreateExecRequest(Lorg/apache/impala/service/Frontend$PlanCtx;Ljava/util/List;Lorg/apache/impala/util/EventSequence;)Lorg/apache/impala/thrift/TExecRequest; (Frontend.java:2893) at org.apache.impala.service.Frontend.getTExecRequest(Lorg/apache/impala/service/Frontend$PlanCtx;Lorg/apache/impala/util/EventSequence;)Lorg/apache/impala/thrift/TExecRequest; (Frontend.java:2410) at org.apache.impala.service.Frontend.createExecRequest(Lorg/apache/impala/service/Frontend$PlanCtx;)Lorg/apache/impala/thrift/TExecRequest; (Frontend.java:2036) at org.apache.impala.service.JniFrontend.createExecRequest([B)[B (JniFrontend.java:175){noformat} The TDescriptorTable objects occupies 9.67GB memory. The duplicated partition column types wasted arround 6.7GB (6M * 2 * 600 bytes) memory. !TDescriptorTable.png|width=960,height=605! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org