Philip Zeyliger has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8894 )

Change subject: IMPALA-6372: Go parallel for Hive dataload
......................................................................


Patch Set 9:

(14 comments)

I'm still reading, but figured I'd push send on the comments so far.

Overall, I'm very much looking forward to this one! Thanks for tackling it!

http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py
File bin/load-data.py:

http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@41
PS9, Line 41: logging.basicConfig(format='%(asctime)s %(message)s', 
datefmt='%H:%M:%S')
Please move this and line 43 into main() at line 299. (Perhaps add a main() 
method.)

It's largely cosmetic, but not doing so prevents others from importing this.


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@129
PS9, Line 129: def exec_cmd(cmd, error_msg=None, exit_on_error=True, 
out_file=None):
add pydoc?


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@137
PS9, Line 137:   except Exception as e:
Since you're using subprocess.call (and not check_call), the only exception 
possible here is failing to open the file? Should we just propagate it and not 
have have special handling?

Regardless, LOG.exception() is handy here. I think exceptions here are pretty 
unexpected.


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@153
PS9, Line 153:   LOG.info("Beginning execution of hive SQL: 
{0}".format(base_file_name))
Consider doing file_name here and not base_file_name. If I have the misfortune 
of looking at this log line, I'll want the whole thing somewhere.


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@214
PS9, Line 214:       traceback.print_exc(100, out_file)
I think you can just do:

traceback.print_exc(file=out_file)

and get whatever the default is for "max depth".

Obviously ignore if it doesn't work.

In [4]: try:
    raise Exception("hey")
except:
    import traceback
    import sys
    traceback.print_exc(file=sys.stderr)
   ...:
Traceback (most recent call last):
  File "<ipython-input-4-4d46ab4becd0>", line 2, in <module>
    raise Exception("hey")
Exception: hey


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@288
PS9, Line 288:  assert(execution_type == 'impala' or execution_type == 'hive')
             :   if execution_type == 'impala':
             :     execution_function = exec_impala_query_from_file
             :   elif execution_type == 'hive':
             :     execution_function = exec_hive_query_from_file_beeline
Consider making exec_query_files_paralllel "private" and create two methods, 
"impala_exec_query_files_parallel" and "hive_exec_query_files_parallel" that 
use this mechanism. It'll save you some string literalls later in the code.


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@318
PS9, Line 318:   LOG.info('Starting data load for the following workloads: ' + 
', '.join(workloads))
Consider logging options.num_processes at this point too.


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@326
PS9, Line 326:   for workload in workloads:
Consider moving the inside of this into a method.

Also, does this get invoked with multiple workloads? If it does, do we want to 
handle those in parallel, since they don't depend on each other? (Or do they?)


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@345
PS9, Line 345:     #    A. Impala table creation scripts run in Impala to 
create tables, partitions,
Do we want to measure the time-consumed by things A-F? i.e., is there more time 
to be gained by understanding more of these dependencies?


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@371
PS9, Line 371:     # TODO: This is insane. It used to be even more insane. 
Maybe it shouldn't be insane?
:)


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@419
PS9, Line 419:         invalidate_files.append(filename)
Could there be files "left over" that shouldn't be there?

elif:
  assert False, "..."?


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@421
PS9, Line 421:     # Execute the data loading scripts.
This seems like as good a place as any to log the contents of the 6 lists you 
created.


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@437
PS9, Line 437:     hive_load_text_file = next((q for q in hive_load_files if 
'text-none-none' in q), None)
Do you need to assert that there's only one match? Would it be simpler to do

hive_load_text_files = [ q for q in hive_load_files if 'text-none-none' in q ]
hive_load_nontext_files = [q for q in hive_load_files if not 'text-none-none' 
in q]
exec_query_files_parallel(hive_load_text_files)
exec_query_files_parallel(hiev_load_nontext_files)


http://gerrit.cloudera.org:8080/#/c/8894/9/bin/load-data.py@441
PS9, Line 441:     if len(hive_load_files) != 0:
Could we make it so that exec_query_files_parallell handles a zero-length list 
gracefully?



--
To view, visit http://gerrit.cloudera.org:8080/8894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Gerrit-Change-Number: 8894
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell <[email protected]>
Gerrit-Reviewer: David Knupp <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Philip Zeyliger <[email protected]>
Gerrit-Comment-Date: Mon, 02 Apr 2018 23:36:11 +0000
Gerrit-HasComments: Yes

Reply via email to