Sailesh Mukil has posted comments on this change. Change subject: IMPALA-2904: Support INSERT and LOAD DATA on S3 and between filesystems ......................................................................
Patch Set 5: (41 comments) Thanks Michael and Taras. I've made the changes. http://gerrit.cloudera.org:8080/#/c/2574/5//COMMIT_MSG Commit Message: Line 19: 'generic_client'. test_load.py is refactored to use this generic > Could we name this something that better indicates what it does? I'm up for I used filesystem_client. Seems like the simplest pick. http://gerrit.cloudera.org:8080/#/c/2574/5/infra/python/deps/requirements.txt File infra/python/deps/requirements.txt: Line 48: boto3 == 1.2.3 > Please continue sort top-level packages (sans ipython) alphabetically. Done http://gerrit.cloudera.org:8080/#/c/2574/5/tests/common/impala_test_suite.py File tests/common/impala_test_suite.py: Line 117: cls.s3_client = cls.create_s3_client() > Do we need to create both clients in L117-118? We always need a HDFS client currently. But I've moved S3 creation into the IS_S3 condition. Line 120: generic_client > Maybe call this cls.filesystem_client? We need hdfs_client as some pure HDFS tests use that. Do you think we should move everything to filesystem_client? I've removed s3_client. Line 156: def create_s3_client(cls): > It seems like indirection to have this method. Could this just be inlined i Done http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_ddl.py File tests/metadata/test_ddl.py: Line 149: "location '{1}/{0}/t3/'".format(DDL_TEST_DB, WAREHOUSE)) > Nit: could you change the ordering of the format params (and format string) The ordering does correspond. We use DDL_TEST_DB first: "create external table {0}.t3(i int)" http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_explain.py File tests/metadata/test_explain.py: Line 106: self.client.execute("CREATE TABLE %s.empty_partition (col int) " \ : "partitioned by (p int)" % self.TEST_DB_NAME); > 1. backslash is not needed Done http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_load.py File tests/metadata/test_load.py: Line 62: "{0}/{1}/100101.txt".format(STAGING_PATH, i)) > This is another wrong alignment. See my comment in tests/metadata/test_expl Done http://gerrit.cloudera.org:8080/#/c/2574/5/tests/metadata/test_recover_partitions.py File tests/metadata/test_recover_partitions.py: Line 117: self.BASE_DIR + null_dir + file_path, null_inserted_value) > needs another indent (2 spaces) Done http://gerrit.cloudera.org:8080/#/c/2574/5/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 48: self.client.execute(("INSERT OVERWRITE functional.%s" : " SELECT int_col FROM functional.tinyinttable" % TBL_NAME)) > L50-51 from before were correctly formatted. This adds extraneous parenthes Done Line 70: for file_ in hidden_file_locations: : if not self.generic_client.exists(table_dir + file_): : err_msg = "Hidden file '%s' was unexpectedly deleted by INSERT OVERWRITE" : pytest.fail(err_msg % (table_dir + file_)) : : for dir_ in dir_locations: : if not self.generic_client.exists(table_dir + file_): : err_msg = "Directory '%s' was unexpectedly deleted by INSERT OVERWRITE" : pytest.fail(err_msg % (table_dir + dir_)) > Instead of if blocks and explicit pytest.fail() calls, why not just asserts Done Line 89: self.generic_client.delete_file_dir(PART_DIR, recursive=True) > Is PART_DIR even in scope? It looks to be part_dir in L84. Looks like PART_DIR was changed to part_dir in a later patch. I missed that in the rebase. Done. Line 360: does_exist = self.generic_client.exists(path) : if does_exist and not should_exist: : pytest.fail("file/dir '%s' unexpectedly exists" % path) : if should_exist and not does_exist: : pytest.fail("file/dir '%s' does not exist" % path) > if -> fail vs. assert question applies here, too. Done http://gerrit.cloudera.org:8080/#/c/2574/5/tests/util/filesystem_base.py File tests/util/filesystem_base.py: Line 20: class BaseFilesystem(object): > Please add some vertical white space to make this look a little less dense. Done Line 21: = > There should be spaces around = Done Line 25: Create a file in 'path' and populate with 'file_data'. If overwrite is true, > s/true/True/ ? Done Line 31: """Create a directory in 'path' with permissions 'permission'""" > What's permission? A umask string, or umask octal, or something else? Done Line 38: def create_file(self, src, dst): > There are two create_file methods. Done Line 47: """Check if a particular path exists""" > ...and does what? Done http://gerrit.cloudera.org:8080/#/c/2574/5/tests/util/hdfs_util.py File tests/util/hdfs_util.py: Line 60: """Checks if a particular path exists"" > Is this comment correct? Done Line 65: return super(PyWebHdfsClientWithChmod, self).make_dir(path, permission=permission) > I don't think it's necessary to override make_dir (and other similar method Done Line 91: file_infos > instead of declaring a variable here, why not just return directly? Done Line 91: super(PyWebHdfsClientWithChmod, self) > Why not just self.list_dir()? Done Line 112: def list_dir(self, path): > It looks like you are changing the behavior of list_dir compared to PyWebHd I thought we wouldn't need list_dir() after that, but I changed the name to 'ls' in any case. Line 119: def exists(self, path): > I think this can be omitted because all you're doing is calling the parent' This function needs to be there because PyWebHdfsClient implements exists() with the exact same signature (no variable arguments). If I don't overload it here, I get an error saying: "Can't instantiate abstract class PyWebHdfsClientWithChmod with abstract method 'exists'". It's okay for the other functions because they have variable arguments and don't need to exactly match the function in filesystem_base.py http://gerrit.cloudera.org:8080/#/c/2574/5/tests/util/s3_util.py File tests/util/s3_util.py: Line 30: if overwrite is False: > if not overwrite Done Line 34: > 4 space indent here Done Line 40: return > pass? Done Line 43: self.s3client.copy_object(Bucket=self.bucketname, : CopySource={'Bucket':self.bucketname, 'Key':src}, Key=dst) > L44 is misaligned Done Line 49: response = self.s3client.list_objects(Bucket=self.bucketname,Prefix=path) > space after "bucketname," Done Line 51: keys = > You can put a return here Done Line 58: if not path[len(path) - 1] == '/': : path = path + '/' > See str.endswith() . I've also seen patterns like this, which I can never d I've accommodated endswith(). path.strip('/') would remove all the forward slashes from the path, so we can't go for that. Line 61: Bucket=self.bucketname,Prefix=path,Delimiter='/') > space after "bucketname,". Also, the line needs another indent level Done Line 63: if response.get('CommonPrefixes'): > Some inline comments explaining what the relevant intermediate data structu Done Line 63: response.get('CommonPrefixes'): > I think it's better to write: Done Line 64: response.get('CommonPrefixes') > It's more Pythonic to write response['CommonPrefixes']. In general, it's no Done Line 71: len(tmp) - 2 > This can be replaced with tmp[-2]. In python tmp[len(tmp) - a] is equivalen I didn't know that. Thanks! Made the change. Line 82: if contents is None: : return False : return True > return contents is not None Done Line 90: if recursive is True: > if recursive: Done Line 93: objects += [{'Key':k}] > objects.append({'Key': k}) Done Line 96: self.s3client.delete_objects(Bucket=self.bucketname,Delete={'Objects':objects}) > space after "bucketname," Done -- To view, visit http://gerrit.cloudera.org:8080/2574 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-HasComments: Yes
