zhangqianqiong created IMPALA-13385:
---------------------------------------
Summary: The response of ddl exceeds VM limit when serializing the
catalog
Key: IMPALA-13385
URL: https://issues.apache.org/jira/browse/IMPALA-13385
Project: IMPALA
Issue Type: Bug
Components: Catalog
Affects Versions: Impala 4.4.1, Impala 4.4.0, Impala 4.3.0, Impala 4.1.2,
Impala 4.1.1, Impala 4.2.0, Impala 4.1.0, Impala 4.0.0
Reporter: zhangqianqiong
Attachments: 企业微信截图_3c4cd519-c64b-45d1-b0f1-889fff752f62.png
At precent, when catalogd responds to DDL operations, it sends the
entire table object. This can lead to a massive transfer of table catalog when
dealing with the hive partitioned table. In one of our customer's clusters,
there is a hive partitioned table with over 4,000 columns, more than 20,000
partitions, and involving over 10 million hdfs files. When executing an `ALTER
TABLE ADD PARTITION` operation on this table, the catalog being serialized for
the table exceeds the java array size limit, resulting in the following
exception: `java.long.OutOfMemoryError: Requested array size exceeds VM limit`.
To alleviate the issue, we can use TCompactProtocol instead of
TBinaryProtocol for thrift serialization. In an experiment with a hive table
containing 160 partitions, I observed that using TCompactProtocol can reduce
the serialized data size by 34.4% compared to the previous method.
Here are potential solutions for addressing this issue:
DDL operations only: Use TCompactProtocol for serializing table catalog during
ExecDdl operations. This would involve fewer changes but requires adjustments
to JniUtil.
Global replacement with TCompactProtocol: Replace all serialization operations
within Impala with TCompactProtocol. Although this a larger change, the overall
code becomes cleaner. In 329 internal benchmark tests, I found no significant
performance degradation compared to the previous implementation, and memory
usage was reduce.
I would like to get some feedback from you.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)