Henry Robinson has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between 
filesystems
......................................................................


Patch Set 5:

(30 comments)

http://gerrit.cloudera.org:8080/#/c/2574/5//COMMIT_MSG
Commit Message:

Line 9: Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
      : functionally enables LOAD DATA and INSERT on S3 without making major
      : changes for the sake of improving performance over S3. This patch also
      : enables both INSERT and LOAD DATA between file systems.
Say something about the approach here. What had to change?


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 47: (
remove


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache;
is this per partition or per filesystem? Not clear here what role this plays in 
the method.


Line 772: over
in or on, not over (you partition over a partition function).


Line 772: allows to
"allow us to" or better "allows writes to tables ..."


Line 773: So we need to open connections to different filesystems as necessary
this sentence should be where you explain the "one cnxn per fs" requirement.


Line 774: the connections
"one connection per filesystem"


Line 772: InsertStmt allows to insert in tables that have partitions over 
multiple filesystems.
        :     // So we need to open connections to different filesystems as 
necessary.
        :     // We use a local connection cache and populate it with the 
connections to the necessary
        :     // filesystems before we do the actual operations (copy, move, 
delete, etc.). This is
        :     // because the operations are done in parallel and we don't want 
to open multiple
        :     // connections to the same filesystem. Doing so could cause 
inconsistencies in the
        :     // filesystem and return with errors.
This comment as a whole would be better before line 754.


Line 776: we don't want to open multiple
        :     // connections to the same filesystem. Doing so could cause 
inconsistencies in the
        :     // filesystem and return with errors
this is vague - what could go wrong?


Line 779: partition_connection
partition_fs_connection (or just fs_connection)


Line 780: RETURN_IF_ERROR
does this leave behind any cleanup that needs to be done?


Line 798: 
this blank line can go


Line 799: bool is_s3_path = false;
        :     if (IsS3APath(part_path.c_str())) is_s3_path = true;
write this as:

  bool is_s3_path = IsS3APath(part_path.c_str());


Line 823:  
indentation (not your change)


Line 846: !is_s3_path
move this into the if on line 843, save a level of nesting


Line 864: if (!is_s3_path
again, combine with above line


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 55: Status::OK();
combine declaration with line 57


Line 60: if (connection_status.ok()) {
instead of this indentation, make a method call "AddError(const string& 
error_msg)" which does lines 102 -> 126 and call it here and return (after 
calling MarkOneOpDone())


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap* 
local_connection_cache) {
        :     DCHECK(local_connection_cache != NULL);
        :     local_connection_cache_ = local_connection_cache;
        :   }
how about just making this a parameter to Execute() - is it needed anywhere 
else?


Line 137: local
what does it mean to be 'local'? Who owns this object?


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 76: const char* src, const char* to_check
does this ever get called with char*s that aren't derived from string objects?


Line 77: scheme of our filesystem support
grammar


Line 79: strncmp
this needs fixing. You need to extract the scheme (by searching for "://" I 
think) and do this properly.


Line 82: bool SpansMultipleFilesystems(const char* pathA, const char* pathB) {
       :   return !IsPathOnFilesystem(pathA, pathB);
       : }
remove this method. Have the caller just call IsPathOnFilesystem().


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:

Line 51: IsPathOnFilesystem
All paths are on some filesystem. Call this "FilesystemsMatch" or something.

What should this return if either path doesn't have a scheme component?


Line 51: to_check
'to_check' doesn't really tell me what this is (and the type doesn't help). 
Obviously by context it's a path, but still better to show this in the variable 
name.

Also, since this method is symmetric (i.e. doesn't matter what order the 
arguments are given), better just to name them symmetrically, e.g. 'path_a', 
'path_b'.


http://gerrit.cloudera.org:8080/#/c/2574/5/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

Line 390:   // Full path to the base directory for this partition.
be explicit about what this includes. Does it have the scheme? Is it a path or 
a URL?


http://gerrit.cloudera.org:8080/#/c/2574/5/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 173:       destFs.rename(sourceFile, destFile);
for simplicity, return here and then save the else { } block below


Line 175: isDestDfs && sameFileSystem
what you actually mean here is if (!arePathsInSameEncryptionZone(...)) which 
suggests it's worthwhile to have a separate variable to track that:

  boolean sameEncryptionZone = arePaths...( );
  boolean doRename = (isDestDfs && sameFileSystem) ? sameEncryptionZone : 
sameFileSystem;


Line 187:           true, true, CONF);
can you get more of this on the previous line?


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Michael Brown <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-HasComments: Yes

Reply via email to