Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-2904: Support INSERT and LOAD DATA on S3 and between 
filesystems
......................................................................


Patch Set 5:

(41 comments)

Thanks Michael and Taras. I've made the changes.

http://gerrit.cloudera.org:8080/#/c/2574/5//COMMIT_MSG
Commit Message:

Line 19: 'generic_client'. test_load.py is refactored to use this generic
> Could we name this something that better indicates what it does? I'm up for
I used filesystem_client. Seems like the simplest pick.


http://gerrit.cloudera.org:8080/#/c/2574/5/infra/python/deps/requirements.txt
File infra/python/deps/requirements.txt:

Line 48: boto3 == 1.2.3
> Please continue sort top-level packages (sans ipython) alphabetically.
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

Line 117:     cls.s3_client = cls.create_s3_client()
> Do we need to create both clients in L117-118?
We always need a HDFS client currently. But I've moved S3 creation into the 
IS_S3 condition.


Line 120: generic_client
> Maybe call this cls.filesystem_client?
We need hdfs_client as some pure HDFS tests use that. Do you think we should 
move everything to filesystem_client? I've removed s3_client.


Line 156:   def create_s3_client(cls):
> It seems like indirection to have this method. Could this just be inlined i
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_ddl.py
File tests/metadata/test_ddl.py:

Line 149:       "location '{1}/{0}/t3/'".format(DDL_TEST_DB, WAREHOUSE))
> Nit: could you change the ordering of the format params (and format string)
The ordering does correspond. We use DDL_TEST_DB first:
"create external table {0}.t3(i int)"


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_explain.py
File tests/metadata/test_explain.py:

Line 106:     self.client.execute("CREATE TABLE %s.empty_partition (col int) " \
        :       "partitioned by (p int)" % self.TEST_DB_NAME);
> 1. backslash is not needed
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_load.py
File tests/metadata/test_load.py:

Line 62:           "{0}/{1}/100101.txt".format(STAGING_PATH, i))
> This is another wrong alignment. See my comment in tests/metadata/test_expl
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_recover_partitions.py
File tests/metadata/test_recover_partitions.py:

Line 117:       self.BASE_DIR + null_dir + file_path, null_inserted_value)
> needs another indent (2 spaces)
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 48:     self.client.execute(("INSERT OVERWRITE functional.%s"
       :         " SELECT int_col FROM functional.tinyinttable" % TBL_NAME))
> L50-51 from before were correctly formatted. This adds extraneous parenthes
Done


Line 70:     for file_ in hidden_file_locations:
       :       if not self.generic_client.exists(table_dir + file_):
       :         err_msg = "Hidden file '%s' was unexpectedly deleted by INSERT 
OVERWRITE"
       :         pytest.fail(err_msg % (table_dir + file_))
       : 
       :     for dir_ in dir_locations:
       :       if not self.generic_client.exists(table_dir + file_):
       :         err_msg = "Directory '%s' was unexpectedly deleted by INSERT 
OVERWRITE"
       :         pytest.fail(err_msg % (table_dir + dir_))
> Instead of if blocks and explicit pytest.fail() calls, why not just asserts
Done


Line 89:     self.generic_client.delete_file_dir(PART_DIR, recursive=True)
> Is PART_DIR even in scope? It looks to be part_dir in L84.
Looks like PART_DIR was changed to part_dir in a later patch. I missed that in 
the rebase. Done.


Line 360:       does_exist = self.generic_client.exists(path)
        :       if does_exist and not should_exist:
        :         pytest.fail("file/dir '%s' unexpectedly exists" % path)
        :       if should_exist and not does_exist:
        :         pytest.fail("file/dir '%s' does not exist" % path)
> if -> fail vs. assert question applies here, too.
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/util/filesystem_base.py
File tests/util/filesystem_base.py:

Line 20: class BaseFilesystem(object):
> Please add some vertical white space to make this look a little less dense.
Done


Line 21: =
> There should be spaces around =
Done


Line 25:     Create a file in 'path' and populate with 'file_data'. If 
overwrite is true,
> s/true/True/ ?
Done


Line 31:     """Create a directory in 'path' with permissions 'permission'"""
> What's permission? A umask string, or umask octal, or something else?
Done


Line 38:   def create_file(self, src, dst):
> There are two create_file methods.
Done


Line 47:     """Check if a particular path exists"""
> ...and does what?
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

Line 60: """Checks if a particular path exists""
> Is this comment correct?
Done


Line 65:     return super(PyWebHdfsClientWithChmod, self).make_dir(path, 
permission=permission)
> I don't think it's necessary to override make_dir (and other similar method
Done


Line 91: file_infos
> instead of declaring a variable here, why not just return directly?
Done


Line 91: super(PyWebHdfsClientWithChmod, self)
> Why not just self.list_dir()?
Done


Line 112:   def list_dir(self, path):
> It looks like you are changing the behavior of list_dir compared to PyWebHd
I thought we wouldn't need list_dir() after that, but I changed the name to 
'ls' in any case.


Line 119:   def exists(self, path):
> I think this can be omitted because all you're doing is calling the parent'
This function needs to be there because PyWebHdfsClient implements exists() 
with the exact same signature (no variable arguments).

If I don't overload it here, I get an error saying:
"Can't instantiate abstract class PyWebHdfsClientWithChmod with abstract method 
'exists'".

It's okay for the other functions because they have variable arguments and 
don't need to exactly match the function in filesystem_base.py


http://gerrit.cloudera.org:8080/#/c/2574/5/tests/util/s3_util.py
File tests/util/s3_util.py:

Line 30:     if overwrite is False:
> if not overwrite
Done


Line 34:   
> 4 space indent here
Done


Line 40:     return
> pass?
Done


Line 43:     self.s3client.copy_object(Bucket=self.bucketname,
       :         CopySource={'Bucket':self.bucketname, 'Key':src}, Key=dst)
> L44 is misaligned
Done


Line 49:     response = 
self.s3client.list_objects(Bucket=self.bucketname,Prefix=path)
> space after "bucketname,"
Done


Line 51: keys =
> You can put a return here
Done


Line 58:     if not path[len(path) - 1] == '/':
       :       path = path + '/'
> See str.endswith() . I've also seen patterns like this, which I can never d
I've accommodated endswith().
path.strip('/') would remove all the forward slashes from the path, so we can't 
go for that.


Line 61:       Bucket=self.bucketname,Prefix=path,Delimiter='/')
> space after "bucketname,". Also, the line needs another indent level
Done


Line 63:     if response.get('CommonPrefixes'):
> Some inline comments explaining what the relevant intermediate data structu
Done


Line 63: response.get('CommonPrefixes'):
> I think it's better to write:
Done


Line 64: response.get('CommonPrefixes')
> It's more Pythonic to write response['CommonPrefixes']. In general, it's no
Done


Line 71: len(tmp) - 2
> This can be replaced with tmp[-2]. In python tmp[len(tmp) - a] is equivalen
I didn't know that. Thanks! Made the change.


Line 82:     if contents is None:
       :       return False
       :     return True
> return contents is not None
Done


Line 90:     if recursive is True:
> if recursive:
Done


Line 93:         objects += [{'Key':k}]
> objects.append({'Key': k})
Done


Line 96:     
self.s3client.delete_objects(Bucket=self.bucketname,Delete={'Objects':objects})
> space after "bucketname,"
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Michael Brown <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-HasComments: Yes

Reply via email to