Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/21845 )
Change subject: IMPALA-13385: Compact the response of ddl operations ...................................................................... Patch Set 6: (5 comments) http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@16 PS6, Line 16: the following exception: `java.long.OutOfMemoryError: Requested array size : exceeds VM limit`. Does the change solve the problem completely (when compact protocol is enough to move the array below 2GB)? My concern is that other parts of the code may also have problems with >2GB arrays: 1. cpp Thrift serialization 2. serializing the table to JVM array in the coordinator before updating Frontend's catalog I am mainly worried about 2, as we don't seem to use compact protocol there: https://github.com/apache/impala/blob/2535e79491078a0353dbeed1a094e91366906149/be/src/catalog/catalog-util.h#L82 serializer_(false) means that the protocol is not compact http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@19 PS6, Line 19: To alleviate the issue, we can use TCompactProtocol instead of : TBinaryProtocol for thrift serialization. This is only done when returning the response object from Java to C++, and protocol of the actual data sent through tcp doesn't change, right? This could be made clearer in the commit message. http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@23 PS6, Line 23: 1. I built the 1k_col_tbl table using the scale_test_metadata script. Did you also look at the CPU effect of this? My assumption is that compact protocol is a bit slower than binary, otherwise it would be the default. If this makes things slower, then this side effect could be avoided by serializing with binary first, and only use compact if the first attempt threw an OutOfMemoryError. (not sure if this worth the additional complexity). Also, if it turns out compact protocol is not really slower, then it could be used everywhere in execAndSerialize(). http://gerrit.cloudera.org:8080/#/c/21845/6//COMMIT_MSG@26 PS6, Line 26: 9.5% An idea to help in this situation is to compress the data before writing it to a byte array in Java. This shouldn't be hard to implement, TSerializer simply writes the bytes to a stream, which could be a GZIPOutputStream backed by a ByteArrayOutputStream https://github.com/apache/thrift/blob/a4d458fdf3668acb6823d42d97dcba62b48bd6af/lib/javame/src/org/apache/thrift/TSerializer.java#L79 The cpp side could decompress it to a native array before deserialization. The issue with this compression is that it would probably have a major performance impact. http://gerrit.cloudera.org:8080/#/c/21845/6/fe/src/main/java/org/apache/impala/service/JniCatalog.java File fe/src/main/java/org/apache/impala/service/JniCatalog.java: http://gerrit.cloudera.org:8080/#/c/21845/6/fe/src/main/java/org/apache/impala/service/JniCatalog.java@263 PS6, Line 263: execAndSerializeWithCompaction WithCompactProtocol would make the intentation of this function clearer than WithCompaction. -- To view, visit http://gerrit.cloudera.org:8080/21845 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idea9313c7f1f1596f3620e60b08a99efc7fa0466 Gerrit-Change-Number: 21845 Gerrit-PatchSet: 6 Gerrit-Owner: zhangqianqiong <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: zhangqianqiong <[email protected]> Gerrit-Comment-Date: Mon, 14 Oct 2024 14:39:07 +0000 Gerrit-HasComments: Yes
