Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8825 )
Change subject: IMPALA-5990: End-to-end compression of metadata ...................................................................... Patch Set 2: (10 comments) Hi, I've not done a deep review of the protocol changes, but took a look through to understand new things for myself. I think it would be useful to write the boring test for the C++ CompressCatalogObject() and DecompressCatalogObject() functions. We could check that things like the empty string go through this just fine. (I have no reason to believe they wouldn't.) We should also decide on our approach to Thrift structs. My biases are to keep them compatible (i.e., don't re-use numbers, keep around deprecated fields forever), but we should decide explicitly, and perhaps update the style guide. http://gerrit.cloudera.org:8080/#/c/8825/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8825/2//COMMIT_MSG@17 PS2, Line 17: - A single catalog object cannot be larger than ~2GB on openjdk 8. It This is probably just "jdk 8" (rather than Open). A lot of people run Impala using the Oracle JDK, so it'd be good to know that Oracle and OpenJDK agree here. http://gerrit.cloudera.org:8080/#/c/8825/2//COMMIT_MSG@23 PS2, Line 23: Testing: Ran existing tests. Manually tested with a 1.95GB catalog object Was this painful to produce? Could we check in a test that creates a huge object and tries to force this to happen, or is it prohibitively expensive? http://gerrit.cloudera.org:8080/#/c/8825/2/be/src/catalog/catalog-server.cc File be/src/catalog/catalog-server.cc: http://gerrit.cloudera.org:8080/#/c/8825/2/be/src/catalog/catalog-server.cc@384 PS2, Line 384: VLOG(1) << "Publishing " << (deleted ? "deletion " : "update ") << ": " << item.key; Would logging size (perhaps both compressed and uncompressed) here be useful? The old log also had a catalog version; I think that's useful to keep. http://gerrit.cloudera.org:8080/#/c/8825/2/be/src/catalog/catalog-util.h File be/src/catalog/catalog-util.h: http://gerrit.cloudera.org:8080/#/c/8825/2/be/src/catalog/catalog-util.h@41 PS2, Line 41: /// 'catalog_object'. Stores the size of the uncopressed catalog object in the first spelling: uncompressed http://gerrit.cloudera.org:8080/#/c/8825/2/be/src/catalog/catalog-util.cc File be/src/catalog/catalog-util.cc: http://gerrit.cloudera.org:8080/#/c/8825/2/be/src/catalog/catalog-util.cc@139 PS2, Line 139: Status CompressCatalogObject(const uint8_t* buf, uint32_t size, string* output) { Does the asymmetric of using string* for output here, and vector<uint8_t>* output in decompression bother us? http://gerrit.cloudera.org:8080/#/c/8825/2/common/thrift/CatalogInternalService.thrift File common/thrift/CatalogInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/8825/2/common/thrift/CatalogInternalService.thrift@28 PS2, Line 28: 1: required i64 callback_ctx I think this should be "2", to preserve the ordering. We don't explicitly support multiple versions of these things working together, but it's good to honor Thrift's goals here. http://gerrit.cloudera.org:8080/#/c/8825/2/common/thrift/CatalogInternalService.thrift@39 PS2, Line 39: 1: required i64 max_catalog_version Technically we should leave 2 and 3 around as optional, renaming them to _deprecated or some such. http://gerrit.cloudera.org:8080/#/c/8825/2/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: http://gerrit.cloudera.org:8080/#/c/8825/2/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@24 PS2, Line 24: import java.util.*; I think we should discourage wildcard imports. (You can configure your IDE to not do this.) We don't formally adopt Google's Java style guide, but https://google.github.io/styleguide/javaguide.html#s3.3.1-wildcard-imports is the reference. It's especially bad in line 46, where nobody knows what might be in org.apache.impala.common.*. http://gerrit.cloudera.org:8080/#/c/8825/2/fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java File fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java: http://gerrit.cloudera.org:8080/#/c/8825/2/fe/src/main/java/org/apache/impala/catalog/ImpaladCatalog.java@162 PS2, Line 162: LOG.trace((update.first ? "Deleting " : "Adding ") + "item: " + key + " version: " I don't know if this is performance sensitive, but slf4j has some mechanism to say: if (LOG.isEnabled(TRACE)) { ... } That would avoiding doing the string interpolation if it's not going to just throw it away. (I didn't look up the syntax for slf4j here.) http://gerrit.cloudera.org:8080/#/c/8825/2/fe/src/main/java/org/apache/impala/util/TByteBuffer.java File fe/src/main/java/org/apache/impala/util/TByteBuffer.java: http://gerrit.cloudera.org:8080/#/c/8825/2/fe/src/main/java/org/apache/impala/util/TByteBuffer.java@26 PS2, Line 26: * ByteBuffer-backed implementation of TTransport. https://github.com/apache/thrift/blame/master/lib/java/src/org/apache/thrift/transport/TByteBuffer.java https://issues.apache.org/jira/browse/THRIFT-3432 Looks like Thrift has added this. We've got other work to upgrade to a more recent version. The implementations are slightly different; let's make sure ours makes sense? -- To view, visit http://gerrit.cloudera.org:8080/8825 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3a8819cad734b3a416eef6c954e55b73cc6023ae Gerrit-Change-Number: 8825 Gerrit-PatchSet: 2 Gerrit-Owner: Tianyi Wang <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Reviewer: Tianyi Wang <[email protected]> Gerrit-Comment-Date: Thu, 14 Dec 2017 21:52:37 +0000 Gerrit-HasComments: Yes
