Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8825 )
Change subject: IMPALA-5990: End-to-end compression of metadata ...................................................................... Patch Set 4: (34 comments) General approach and flow looks good to me. I'm still thinking about details and testing. http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/CMakeLists.txt File be/src/catalog/CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/CMakeLists.txt@29 PS4, Line 29: ADD_BE_TEST(catalog-util-test) I think you forgot to add the .cc file http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-server.h File be/src/catalog/catalog-server.h: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-server.h@78 PS4, Line 78: void AddTopicUpdate(std::string key, const uint8_t* buf, uint32_t size, bool deleted); * AddPendingTopicItem() (and corresponding change to comment, see a comment elsewhere explaining our use of "topic" and "item") * Pass the key as reference? Seems clearer what is happening. Also comment that the function takes ownership of the key memory * 'buf' needs a better name http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-server.cc File be/src/catalog/catalog-server.cc: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-server.cc@245 PS4, Line 245: update.topic_entries.emplace_back(); Why not this instead of the loop: update.topic_entries = std::move(pending_topic_updates_); http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-util.h File be/src/catalog/catalog-util.h: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-util.h@44 PS4, Line 44: Status CompressCatalogObject(const uint8_t* buf, uint32_t size, Use src/dst as arg names. Same for decompress below. http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-util.cc File be/src/catalog/catalog-util.cc: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-util.cc@145 PS4, Line 145: output->resize(static_cast<size_t>(output_buffer_len)); why static_cast? http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog-util.cc@162 PS4, Line 162: output->resize(static_cast<size_t>(decompressed_len)); why static_cast? http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog.h File be/src/catalog/catalog.h: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/catalog/catalog.h@64 PS4, Line 64: Status GetCatalogDelta(void* server_ptr, int64_t from_version, "CatalogServer* caller" instead of "void* server_ptr", seems clearer, you can forward declare class CatalogServer to avoid the include http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/fe-support.cc File be/src/service/fe-support.cc: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/fe-support.cc@429 PS4, Line 429: jclass caller_class, jlong callback_ctx, jbyteArray key, Currently, we get the key bytes via String.getBytes() which I believe must create a new byte[]. If we pass the key as a jstring we can avoid a copy in most cases. http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/fe-support.cc@432 PS4, Line 432: signed char* key_buf = env->GetByteArrayElements(key, nullptr); I think we should consider using the "critical" version of this function. See JniScopedArrayCritical from: https://gerrit.cloudera.org/#/c/8150/6/be/src/util/jni-util.h That patch never made it in, you could just copy the code. http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/fe-support.cc@446 PS4, Line 446: Java_org_apache_impala_service_FeSupport_NativeGetCatalogUpdate(JNIEnv* env, NativeGetNextCatalogTopicItem() or similar the name should convey that this is a "get next" type of call and what exactly is being iterated over http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/fe-support.cc@454 PS4, Line 454: jclass caller_class, jbyteArray hdfs_location) { why not "jstring hdfs_location" http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/fe-support.cc@465 PS4, Line 465: jclass caller_class, jbyteArray hdfs_lib_file) { why not "jstring hdfs_lib_file" http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/fe-support.cc@551 PS4, Line 551: (char*)"NativeAddCatalogUpdate", (char*)"(J[B[BZ)V", make indentation consistent http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/frontend.h File be/src/service/frontend.h: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/frontend.h@41 PS4, Line 41: /// Request to update the Impalad catalog cache. The req argument contains a vector of Update comment on the 'req' arg http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/frontend.h@214 PS4, Line 214: // A helper class used to pass catalog object updates to the frontend. /// comment style here and below http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/frontend.h@217 PS4, Line 217: // Return the next catalog object update. The return type is Pair<Boolean, ByteBuffer>. Return the next catalog object from a catalog update. http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/frontend.h@220 PS4, Line 220: // return value is invalided on the next call. Comment on error handling http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/frontend.cc File be/src/service/frontend.cc: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/frontend.cc@274 PS4, Line 274: jclass pair_class = env->FindClass("Lorg/apache/impala/common/Pair;"); Better to get the class and method ids once during a static initialization phase. You can keep them as static members like is done in other JNI-based helpers, e.g., hbase-table.h http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/impala-server.cc@1319 PS4, Line 1319: class TopicItemIterator : public JniCatalogCacheUpdateIterator { Let's move these and its subclasses into catalog-util.h/cc or frontend.h/cc I could not find any guidance on this in the Google style guide. Personally, I find such inline class declarations obscure the rest of the function code and we generally don't have this pattern in the codebase. We should also explain the motivation behind using these iterators as opposed to a simpler approach where we populate two ArrayLists in one JNI call. The benefit of the iterators is that they reduce the intermediate memory consumption, right? http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/impala-server.cc@1337 PS4, Line 1337: LOG(ERROR) << "Error decompressing catalog object: " << s.GetDetail(); I think we could deal with errors in two ways: 1. We could propagate them as exceptions in env which are picked up on the Java side. That makes it clear that the update process failed and that the catalog is in an inconsistent state. 2. Instead of returning on the first error, we could attempt to process as many objects as we can and then LOG(ERROR) a report of failed updates. I have a slight preference for (2), but would also accept (1). In the current code, we don't have a good way to determine whether a null return value means we have successfully process all objects or not. It's also not obvious from the log message that the entire update process is aborted. http://gerrit.cloudera.org:8080/#/c/8825/4/be/src/service/impala-server.cc@1483 PS4, Line 1483: LOG(ERROR) << "Error serializing catalog object: " << s.GetDetail(); Same comments about error handling as above http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@395 PS4, Line 395: long callbackContext_; Maybe call this nativeCatalogServerPtr or similar to make it clear that this is a raw pointer to a native object. http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@397 PS4, Line 397: long fromVersion_; We only use the "_" suffix for private and protected members. I think it make sense to omit them here. http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@409 PS4, Line 409: void addTopic(TCatalogObject obj, boolean delete) throws TException { We typically use the term "topic" in the statestore topic sense, i.e. there is a metadata topic or a cluster membership topic. A clearer function name would be addTopicItem() or addItem() or addCatalogObject() or something like that. http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@418 PS4, Line 418: byte[] data = new TSerializer(new TBinaryProtocol.Factory()).serialize(obj); Do we need to create a new protocol and serializer each time this function is called? I think we can probably have the TSerializer as a member of this class. http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java File fe/src/main/java/org/apache/impala/service/FeSupport.java: http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@25 PS4, Line 25: import io.netty.buffer.ByteBuf; remove? http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@26 PS4, Line 26: import javolution.io.Struct; remove? http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@86 PS4, Line 86: // Add a topic update to the backend pending topic updates. 'serializationBuffer' is a Adds a topic item to the backend's pending metadata-topic update. http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@91 PS4, Line 91: // Get a topic update from the backend. // Get a catalog topic item from the backend, http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@93 PS4, Line 93: public native static Pair<Boolean, ByteBuffer> NativeGetCatalogUpdate( NativeGetCatalogTopicItem() http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@94 PS4, Line 94: long callbackContext); callbackContext sounds mysterious, how about nativeIteratorPtr or another name that makes it clear that this is a native pointer to an object (or more specifically an iterator) http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@96 PS4, Line 96: public native static void NativeLibCacheSetNeedsRefresh(byte[] hdfs_location); * camel case for args here and below * why is the location not a String (jstring in JNI)? Properly typing the arg makes the function easier to read/understand http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@97 PS4, Line 97: remove blank line http://gerrit.cloudera.org:8080/#/c/8825/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@98 PS4, Line 98: public native static void NativeLibCacheRemoveEntry(byte[] hdfs_lib_file); * camel case for arg * why is the location not a String (jstring in JNI)? Properly typing the arg makes the function easier to read/understand -- To view, visit http://gerrit.cloudera.org:8080/8825 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3a8819cad734b3a416eef6c954e55b73cc6023ae Gerrit-Change-Number: 8825 Gerrit-PatchSet: 4 Gerrit-Owner: Tianyi Wang <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Reviewer: Tianyi Wang <[email protected]> Gerrit-Comment-Date: Wed, 10 Jan 2018 05:33:25 +0000 Gerrit-HasComments: Yes
