[Impala-ASF-CR] IMPALA-13385: Compact the response of ddl operations

Csaba Ringhofer (Code Review) Mon, 14 Oct 2024 07:43:22 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21845 )


Change subject: IMPALA-13385: Compact the response of ddl operations
......................................................................


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@16
PS6, Line 16: the following exception: `java.long.OutOfMemoryError: Requested 
array size
            : exceeds VM limit`.
Does the change solve the problem completely (when compact protocol is enough 
to move the array below 2GB)? My concern is that other parts of the code may 
also have problems with >2GB arrays:
1. cpp Thrift serialization
2. serializing the table to JVM array in the coordinator before updating 
Frontend's catalog

I am mainly worried about 2, as we don't seem to use compact protocol there:
https://github.com/apache/impala/blob/2535e79491078a0353dbeed1a094e91366906149/be/src/catalog/catalog-util.h#L82

serializer_(false) means that the protocol is not compact


http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@19
PS6, Line 19: To alleviate the issue, we can use TCompactProtocol instead of
            : TBinaryProtocol for thrift serialization.
This is only done when returning the response object from Java to C++, and 
protocol of the actual data sent through tcp doesn't change, right? This could 
be made clearer in the commit message.


http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@23
PS6, Line 23:    1. I built the 1k_col_tbl table using the scale_test_metadata 
script.
Did you also look at the CPU effect of this? My assumption is that compact 
protocol is a bit slower than binary, otherwise it would be the default.

If this makes things slower, then this side effect could be avoided by 
serializing with binary first, and only use compact if the first attempt threw 
an OutOfMemoryError. (not sure if this worth the additional complexity).

Also, if it turns out compact protocol is not really slower, then it could be 
used everywhere in execAndSerialize().


http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@26
PS6, Line 26:  9.5%
An idea to help in this situation is to compress the data before writing it to 
a byte array in Java. This shouldn't be hard to implement, TSerializer simply 
writes the bytes to a stream, which could be a GZIPOutputStream backed by a 
ByteArrayOutputStream
https://github.com/apache/thrift/blob/a4d458fdf3668acb6823d42d97dcba62b48bd6af/lib/javame/src/org/apache/thrift/TSerializer.java#L79

The cpp side could decompress it to a native array before deserialization.

The issue with this compression is that it would probably have a major 
performance impact.


http://gerrit.cloudera.org:8080/#/c/21845/6/fe/src/main/java/org/apache/impala/service/JniCatalog.java
File fe/src/main/java/org/apache/impala/service/JniCatalog.java:

http://gerrit.cloudera.org:8080/#/c/21845/6/fe/src/main/java/org/apache/impala/service/JniCatalog.java@263
PS6, Line 263: execAndSerializeWithCompaction
WithCompactProtocol would make the intentation of this function clearer than 
WithCompaction.



--
To view, visit http://gerrit.cloudera.org:8080/21845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idea9313c7f1f1596f3620e60b08a99efc7fa0466
Gerrit-Change-Number: 21845
Gerrit-PatchSet: 6
Gerrit-Owner: zhangqianqiong <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: zhangqianqiong <[email protected]>
Gerrit-Comment-Date: Mon, 14 Oct 2024 14:39:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-13385: Compact the response of ddl operations

Reply via email to