Philip Zeyliger has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8894 )

Change subject: IMPALA-6372: Go parallel for Hive dataload
......................................................................


Patch Set 7:

(16 comments)

I did one pass through. This looks like a nice improvement!

http://gerrit.cloudera.org:8080/#/c/8894/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8894/7//COMMIT_MSG@34
PS7, Line 34: This saves about 10-15 minutes on dataload (including
Nice!


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py
File bin/load-data.py:

http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@39
PS7, Line 39: from multiprocessing import Pool
If you're using multiprocessing to offload work that doesn't happen in Python, 
you're better off with threadpool. Mostly, it doesn't kill your exception 
traces, which you really want when things go wrong. 
https://bugs.python.org/issue13831Ups was the bug.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@132
PS7, Line 132:     if (out_file):
nit: you don't need the parens.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@148
PS7, Line 148:   if not os.path.exists(file_name): return False
Is this interesting that we should print "File not found" or something?


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@174
PS7, Line 174:   if (is_success):
nit: no parens


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@190
PS7, Line 190:   if not os.path.exists(file_name): return False
same as above; log the filename?


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@212
PS7, Line 212:   if (is_success):
nit: no parens


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@283
PS7, Line 283: def exec_query_files_parallel(thread_pool, query_files, 
execution_type):
pydoc?


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@290
PS7, Line 290:     assert(False), "execution_type must be 'hive' or 'impala'"
I don't think this can happen given line 284.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@292
PS7, Line 292:   for result in thread_pool.imap_unordered(execution_function, 
query_files):
Given that you're using a thread pool, do you have issues with the print 
statements interleaving? If you switch to python logging, it has a lock to make 
the output not interleaved.

It looks like this was already similar previously, so it may not have be worse.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@301
PS7, Line 301:   LOG.debug(' '.join(sys.argv))
Is this printed? I don't see where logging is configured...


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@341
PS7, Line 341:     impala_create_files = [f for f in dataset_dir_contents if 
create_filename in f]
It took me a second to realize that "create_filename in f" is a substring 
search of create-...-impala-generated in some filename. .endswith() or 
.startswith() would be more obvious, but may not be true.

Consider ditching the list comprehensions and just doing

for f in dataset_dir_contents:
  if create_filename in f:
    impala_create_files.append(f)
   ....

It's more lines, but I think it's more obvious.

I'm flexible here.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@333
PS7, Line 333:     copy_avro_schemas_to_hdfs(os.path.join(sql_dir, 
AVRO_SCHEMA_DIR))
             :     dataset_dir_contents = [os.path.join(sql_dir, f) for f in 
os.listdir(sql_dir)]
             :     load_file_substr = "%s-%s" % (workload, 
options.exploration_strategy)
             :     # Data loading for both Impala and Hive is done in parallel, 
each file format has a
             :     # separate query file.
             :     create_filename = 'create-%s-impala-generated' % 
load_file_substr
             :     hive_load_filename = '%s-hive-generated' % load_file_substr
             :     load_filename = 'load-%s-impala-generated' % load_file_substr
             :     impala_create_files = [f for f in dataset_dir_contents if 
create_filename in f]
             :     hive_load_files = [f for f in dataset_dir_contents if 
hive_load_filename in f]
             :     impala_load_files = [f for f in dataset_dir_contents if 
load_filename in f]
             :
             :     # Execute the data
Since it's probably in your head, could you take a stab at documenting what 
this script expects in terms of files that exist?  There's a directory 
structure here and a file name scheme here that's pretty opaque.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@375
PS7, Line 375:     exec_query_files_parallel(thread_pool, impala_load_files, 
'impala')
It doesn't matter that this isn't in the if in line 373 because doing an empty 
list of parallel work is nothing? A comment may help.


http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@460
PS7, Line 460:     return bool(self.create) or bool(self.load) or 
bool(self.load_base)
I think this is:

return any(self.create, self.load, self.load_base)

and also

return self.create or self.load or self.load_base

Or am I missing something?


http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@641
PS7, Line 641:       if (output == hive_output):
nit: no parens



--
To view, visit http://gerrit.cloudera.org:8080/8894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Gerrit-Change-Number: 8894
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 Jan 2018 05:54:52 +0000
Gerrit-HasComments: Yes

Reply via email to