Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8894 )
Change subject: IMPALA-6372: Go parallel for Hive dataload ...................................................................... Patch Set 7: (16 comments) http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py File bin/load-data.py: http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@39 PS7, Line 39: from multiprocessing import Pool > If you're using multiprocessing to offload work that doesn't happen in Pyth Switched to multiprocessing.pool.ThreadPool. Verified that an exception in one of the threads produces a usable stack trace. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@41 PS7, Line 41: logging.basicConfig() : LOG = logging.getLogger('load-data.py') : LOG.setLevel(logging.DEBUG) This configures Python logging. I modified it so that messages are prefixed by the time. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@132 PS7, Line 132: if (out_file): > nit: you don't need the parens. Done http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@148 PS7, Line 148: if not os.path.exists(file_name): return False > Is this interesting that we should print "File not found" or something? Good point. Added a messege. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@174 PS7, Line 174: if (is_success): > nit: no parens Done http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@190 PS7, Line 190: if not os.path.exists(file_name): return False > same as above; log the filename? Done http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@212 PS7, Line 212: if (is_success): > nit: no parens Done http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@283 PS7, Line 283: def exec_query_files_parallel(thread_pool, query_files, execution_type): > pydoc? Added http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@290 PS7, Line 290: assert(False), "execution_type must be 'hive' or 'impala'" > I don't think this can happen given line 284. Done http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@292 PS7, Line 292: for result in thread_pool.imap_unordered(execution_function, query_files): > Given that you're using a thread pool, do you have issues with the print st This changes the logging so that each SQL file puts most of its logging into its own log file (i.e. a.sql -> a.sql.log) . The actual output back to stdout is just one print when starting to execute a SQL file and one print when the SQL file is done. The rate should be low. Switched most print statements to python logging. Most are at the INFO level. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@301 PS7, Line 301: LOG.debug(' '.join(sys.argv)) > Is this printed? I don't see where logging is configured... Looks like it is: DEBUG:load-data.py:/home/joe/view3/Impala/bin/load-data.py --workloads tpch -e core --impalad localhost:21000 --hive_hs2_hostport localhost:11050 --hdfs_namenode localhost:20500 http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@341 PS7, Line 341: impala_create_files = [f for f in dataset_dir_contents if create_filename in f] > It took me a second to realize that "create_filename in f" is a substring s Restructured these to be a bit clearer. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@333 PS7, Line 333: copy_avro_schemas_to_hdfs(os.path.join(sql_dir, AVRO_SCHEMA_DIR)) : dataset_dir_contents = [os.path.join(sql_dir, f) for f in os.listdir(sql_dir)] : load_file_substr = "%s-%s" % (workload, options.exploration_strategy) : # Data loading for both Impala and Hive is done in parallel, each file format has a : # separate query file. : create_filename = 'create-%s-impala-generated' % load_file_substr : hive_load_filename = '%s-hive-generated' % load_file_substr : load_filename = 'load-%s-impala-generated' % load_file_substr : impala_create_files = [f for f in dataset_dir_contents if create_filename in f] : hive_load_files = [f for f in dataset_dir_contents if hive_load_filename in f] : impala_load_files = [f for f in dataset_dir_contents if load_filename in f] : : # Execute the data > Since it's probably in your head, could you take a stab at documenting what Restructured this to add some comments and make things clearer. I also changed generate-schema-statements.py to get it up to date. http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@375 PS7, Line 375: exec_query_files_parallel(thread_pool, impala_load_files, 'impala') > It doesn't matter that this isn't in the if in line 373 because doing an em Moved this inside the if. http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@460 PS7, Line 460: return bool(self.create) or bool(self.load) or bool(self.load_base) > I think this is: I'm not sure how important the datatypes are, but or of lists is not a bool. e.g. a = ['a'] b = ['b'] a or b -> 'a' My guess is that this would still work, but it is a bit weird. Changed this to cleaner version: bool(self.create or self.load or self.load_base) http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@641 PS7, Line 641: if (output == hive_output): > nit: no parens Done -- To view, visit http://gerrit.cloudera.org:8080/8894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Gerrit-Change-Number: 8894 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Comment-Date: Tue, 27 Feb 2018 00:18:53 +0000 Gerrit-HasComments: Yes
