Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8894 )
Change subject: IMPALA-6372: Go parallel for Hive dataload ...................................................................... Patch Set 7: (16 comments) I did one pass through. This looks like a nice improvement! http://gerrit.cloudera.org:8080/#/c/8894/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8894/7//COMMIT_MSG@34 PS7, Line 34: This saves about 10-15 minutes on dataload (including Nice! http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py File bin/load-data.py: http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@39 PS7, Line 39: from multiprocessing import Pool If you're using multiprocessing to offload work that doesn't happen in Python, you're better off with threadpool. Mostly, it doesn't kill your exception traces, which you really want when things go wrong. https://bugs.python.org/issue13831Ups was the bug. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@132 PS7, Line 132: if (out_file): nit: you don't need the parens. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@148 PS7, Line 148: if not os.path.exists(file_name): return False Is this interesting that we should print "File not found" or something? http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@174 PS7, Line 174: if (is_success): nit: no parens http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@190 PS7, Line 190: if not os.path.exists(file_name): return False same as above; log the filename? http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@212 PS7, Line 212: if (is_success): nit: no parens http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@283 PS7, Line 283: def exec_query_files_parallel(thread_pool, query_files, execution_type): pydoc? http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@290 PS7, Line 290: assert(False), "execution_type must be 'hive' or 'impala'" I don't think this can happen given line 284. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@292 PS7, Line 292: for result in thread_pool.imap_unordered(execution_function, query_files): Given that you're using a thread pool, do you have issues with the print statements interleaving? If you switch to python logging, it has a lock to make the output not interleaved. It looks like this was already similar previously, so it may not have be worse. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@301 PS7, Line 301: LOG.debug(' '.join(sys.argv)) Is this printed? I don't see where logging is configured... http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@341 PS7, Line 341: impala_create_files = [f for f in dataset_dir_contents if create_filename in f] It took me a second to realize that "create_filename in f" is a substring search of create-...-impala-generated in some filename. .endswith() or .startswith() would be more obvious, but may not be true. Consider ditching the list comprehensions and just doing for f in dataset_dir_contents: if create_filename in f: impala_create_files.append(f) .... It's more lines, but I think it's more obvious. I'm flexible here. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@333 PS7, Line 333: copy_avro_schemas_to_hdfs(os.path.join(sql_dir, AVRO_SCHEMA_DIR)) : dataset_dir_contents = [os.path.join(sql_dir, f) for f in os.listdir(sql_dir)] : load_file_substr = "%s-%s" % (workload, options.exploration_strategy) : # Data loading for both Impala and Hive is done in parallel, each file format has a : # separate query file. : create_filename = 'create-%s-impala-generated' % load_file_substr : hive_load_filename = '%s-hive-generated' % load_file_substr : load_filename = 'load-%s-impala-generated' % load_file_substr : impala_create_files = [f for f in dataset_dir_contents if create_filename in f] : hive_load_files = [f for f in dataset_dir_contents if hive_load_filename in f] : impala_load_files = [f for f in dataset_dir_contents if load_filename in f] : : # Execute the data Since it's probably in your head, could you take a stab at documenting what this script expects in terms of files that exist? There's a directory structure here and a file name scheme here that's pretty opaque. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@375 PS7, Line 375: exec_query_files_parallel(thread_pool, impala_load_files, 'impala') It doesn't matter that this isn't in the if in line 373 because doing an empty list of parallel work is nothing? A comment may help. http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@460 PS7, Line 460: return bool(self.create) or bool(self.load) or bool(self.load_base) I think this is: return any(self.create, self.load, self.load_base) and also return self.create or self.load or self.load_base Or am I missing something? http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@641 PS7, Line 641: if (output == hive_output): nit: no parens -- To view, visit http://gerrit.cloudera.org:8080/8894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Gerrit-Change-Number: 8894 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Comment-Date: Fri, 19 Jan 2018 05:54:52 +0000 Gerrit-HasComments: Yes