[ 
https://issues.apache.org/jira/browse/IMPALA-13385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883144#comment-17883144
 ] 

zhangqianqiong commented on IMPALA-13385:
-----------------------------------------

       Hello [~rizaon], thank you for your reply.

       Indeed, we are not using the local catalog mode. Our internal branch is 
based on version 4.0.0, and at that time, the local catalog feature was not yet 
fully mature, so we didn't adopt it. If we were to cherry-pick patches from 
version 4.1.0, it would require a significant amount of work, not to mention 
the testing required for the local catalog.  Our customer can't wait for this 
feature to be fully mature, so I am submitting a patch of `DDL operations 
only`, could you please help review it first? 

       As for Iceberg table, our latest product version is developed based on 
it. However, event when Iceberg tables, in full catalog mode, the DDL response 
would still serialize the whole table's catalog, right?

       cc: [~stigahuang], [~boroknagyz] 

 

> The response of ddl exceeds VM limit when serializing the catalog
> -----------------------------------------------------------------
>
>                 Key: IMPALA-13385
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13385
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 4.0.0, Impala 4.1.0, Impala 4.2.0, Impala 4.1.1, 
> Impala 4.1.2, Impala 4.3.0, Impala 4.4.0, Impala 4.4.1
>            Reporter: zhangqianqiong
>            Priority: Major
>         Attachments: 企业微信截图_3c4cd519-c64b-45d1-b0f1-889fff752f62.png
>
>
>          At precent, when catalogd responds to DDL operations, it sends the 
> entire table object. This can lead to a massive transfer of table catalog 
> when dealing with the  hive partitioned table. In one of our customer's 
> clusters, there is a hive partitioned table with over 4,000 columns, more 
> than 20,000 partitions, and involving over 10 million hdfs files. When 
> executing an `ALTER TABLE ADD PARTITION` operation on this table, the catalog 
> being serialized for the table exceeds the java array size limit, resulting 
> in the following exception: `java.long.OutOfMemoryError: Requested array size 
> exceeds VM limit`.
>         To alleviate the issue, we can use TCompactProtocol instead of 
> TBinaryProtocol for thrift serialization. In an experiment with a hive table 
> containing 160 partitions, I observed that using TCompactProtocol can reduce 
> the serialized data size by 34.4% compared to the previous method.
>         Here are potential solutions for addressing this issue:
>         DDL operations only: Use TCompactProtocol for serializing table 
> catalog during ExecDdl operations. This would involve fewer changes but 
> requires adjustments to JniUtil.
>         Global replacement with TCompactProtocol: Replace all serialization 
> operations within Impala with TCompactProtocol. Although this a larger 
> change, the overall code becomes cleaner. In 329 internal benchmark tests, I 
> found no significant performance degradation compared to the previous 
> implementation, and memory usage was reduce.
>        Looking forward to any feedback.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to