Henry Robinson has posted comments on this change. Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems ......................................................................
Patch Set 5: (30 comments) http://gerrit.cloudera.org:8080/#/c/2574/5//COMMIT_MSG Commit Message: Line 9: Previously Impala disallowed LOAD DATA and INSERT on S3. This patch : functionally enables LOAD DATA and INSERT on S3 without making major : changes for the sake of improving performance over S3. This patch also : enables both INSERT and LOAD DATA between file systems. Say something about the approach here. What had to change? http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 47: ( remove http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache; is this per partition or per filesystem? Not clear here what role this plays in the method. Line 772: over in or on, not over (you partition over a partition function). Line 772: allows to "allow us to" or better "allows writes to tables ..." Line 773: So we need to open connections to different filesystems as necessary this sentence should be where you explain the "one cnxn per fs" requirement. Line 774: the connections "one connection per filesystem" Line 772: InsertStmt allows to insert in tables that have partitions over multiple filesystems. : // So we need to open connections to different filesystems as necessary. : // We use a local connection cache and populate it with the connections to the necessary : // filesystems before we do the actual operations (copy, move, delete, etc.). This is : // because the operations are done in parallel and we don't want to open multiple : // connections to the same filesystem. Doing so could cause inconsistencies in the : // filesystem and return with errors. This comment as a whole would be better before line 754. Line 776: we don't want to open multiple : // connections to the same filesystem. Doing so could cause inconsistencies in the : // filesystem and return with errors this is vague - what could go wrong? Line 779: partition_connection partition_fs_connection (or just fs_connection) Line 780: RETURN_IF_ERROR does this leave behind any cleanup that needs to be done? Line 798: this blank line can go Line 799: bool is_s3_path = false; : if (IsS3APath(part_path.c_str())) is_s3_path = true; write this as: bool is_s3_path = IsS3APath(part_path.c_str()); Line 823: indentation (not your change) Line 846: !is_s3_path move this into the if on line 843, save a level of nesting Line 864: if (!is_s3_path again, combine with above line http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.cc File be/src/util/hdfs-bulk-ops.cc: Line 55: Status::OK(); combine declaration with line 57 Line 60: if (connection_status.ok()) { instead of this indentation, make a method call "AddError(const string& error_msg)" which does lines 102 -> 126 and call it here and return (after calling MarkOneOpDone()) http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.h File be/src/util/hdfs-bulk-ops.h: Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap* local_connection_cache) { : DCHECK(local_connection_cache != NULL); : local_connection_cache_ = local_connection_cache; : } how about just making this a parameter to Execute() - is it needed anywhere else? Line 137: local what does it mean to be 'local'? Who owns this object? http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.cc File be/src/util/hdfs-util.cc: Line 76: const char* src, const char* to_check does this ever get called with char*s that aren't derived from string objects? Line 77: scheme of our filesystem support grammar Line 79: strncmp this needs fixing. You need to extract the scheme (by searching for "://" I think) and do this properly. Line 82: bool SpansMultipleFilesystems(const char* pathA, const char* pathB) { : return !IsPathOnFilesystem(pathA, pathB); : } remove this method. Have the caller just call IsPathOnFilesystem(). http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.h File be/src/util/hdfs-util.h: Line 51: IsPathOnFilesystem All paths are on some filesystem. Call this "FilesystemsMatch" or something. What should this return if either path doesn't have a scheme component? Line 51: to_check 'to_check' doesn't really tell me what this is (and the type doesn't help). Obviously by context it's a path, but still better to show this in the variable name. Also, since this method is symmetric (i.e. doesn't matter what order the arguments are given), better just to name them symmetrically, e.g. 'path_a', 'path_b'. http://gerrit.cloudera.org:8080/#/c/2574/5/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: Line 390: // Full path to the base directory for this partition. be explicit about what this includes. Does it have the scheme? Is it a path or a URL? http://gerrit.cloudera.org:8080/#/c/2574/5/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java: Line 173: destFs.rename(sourceFile, destFile); for simplicity, return here and then save the else { } block below Line 175: isDestDfs && sameFileSystem what you actually mean here is if (!arePathsInSameEncryptionZone(...)) which suggests it's worthwhile to have a separate variable to track that: boolean sameEncryptionZone = arePaths...( ); boolean doRename = (isDestDfs && sameFileSystem) ? sameEncryptionZone : sameFileSystem; Line 187: true, true, CONF); can you get more of this on the previous line? -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-HasComments: Yes
