Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8894 )
Change subject: IMPALA-6372: Go parallel for Hive dataload ...................................................................... Patch Set 9: (14 comments) I'm still reading, but figured I'd push send on the comments so far. Overall, I'm very much looking forward to this one! Thanks for tackling it! http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py File bin/load-data.py: http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@41 PS9, Line 41: logging.basicConfig(format='%(asctime)s %(message)s', datefmt='%H:%M:%S') Please move this and line 43 into main() at line 299. (Perhaps add a main() method.) It's largely cosmetic, but not doing so prevents others from importing this. http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@129 PS9, Line 129: def exec_cmd(cmd, error_msg=None, exit_on_error=True, out_file=None): add pydoc? http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@137 PS9, Line 137: except Exception as e: Since you're using subprocess.call (and not check_call), the only exception possible here is failing to open the file? Should we just propagate it and not have have special handling? Regardless, LOG.exception() is handy here. I think exceptions here are pretty unexpected. http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@153 PS9, Line 153: LOG.info("Beginning execution of hive SQL: {0}".format(base_file_name)) Consider doing file_name here and not base_file_name. If I have the misfortune of looking at this log line, I'll want the whole thing somewhere. http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@214 PS9, Line 214: traceback.print_exc(100, out_file) I think you can just do: traceback.print_exc(file=out_file) and get whatever the default is for "max depth". Obviously ignore if it doesn't work. In [4]: try: raise Exception("hey") except: import traceback import sys traceback.print_exc(file=sys.stderr) ...: Traceback (most recent call last): File "<ipython-input-4-4d46ab4becd0>", line 2, in <module> raise Exception("hey") Exception: hey http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@288 PS9, Line 288: assert(execution_type == 'impala' or execution_type == 'hive') : if execution_type == 'impala': : execution_function = exec_impala_query_from_file : elif execution_type == 'hive': : execution_function = exec_hive_query_from_file_beeline Consider making exec_query_files_paralllel "private" and create two methods, "impala_exec_query_files_parallel" and "hive_exec_query_files_parallel" that use this mechanism. It'll save you some string literalls later in the code. http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@318 PS9, Line 318: LOG.info('Starting data load for the following workloads: ' + ', '.join(workloads)) Consider logging options.num_processes at this point too. http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@326 PS9, Line 326: for workload in workloads: Consider moving the inside of this into a method. Also, does this get invoked with multiple workloads? If it does, do we want to handle those in parallel, since they don't depend on each other? (Or do they?) http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@345 PS9, Line 345: # A. Impala table creation scripts run in Impala to create tables, partitions, Do we want to measure the time-consumed by things A-F? i.e., is there more time to be gained by understanding more of these dependencies? http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@371 PS9, Line 371: # TODO: This is insane. It used to be even more insane. Maybe it shouldn't be insane? :) http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@419 PS9, Line 419: invalidate_files.append(filename) Could there be files "left over" that shouldn't be there? elif: assert False, "..."? http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@421 PS9, Line 421: # Execute the data loading scripts. This seems like as good a place as any to log the contents of the 6 lists you created. http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@437 PS9, Line 437: hive_load_text_file = next((q for q in hive_load_files if 'text-none-none' in q), None) Do you need to assert that there's only one match? Would it be simpler to do hive_load_text_files = [ q for q in hive_load_files if 'text-none-none' in q ] hive_load_nontext_files = [q for q in hive_load_files if not 'text-none-none' in q] exec_query_files_parallel(hive_load_text_files) exec_query_files_parallel(hiev_load_nontext_files) http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@441 PS9, Line 441: if len(hive_load_files) != 0: Could we make it so that exec_query_files_parallell handles a zero-length list gracefully? -- To view, visit http://gerrit.cloudera.org:8080/8894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Gerrit-Change-Number: 8894 Gerrit-PatchSet: 9 Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: David Knupp <[email protected]> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Comment-Date: Mon, 02 Apr 2018 23:36:11 +0000 Gerrit-HasComments: Yes
