[Impala-ASF-CR] IMPALA-6372: Go parallel for Hive dataload

Joe McDonnell (Code Review) Mon, 26 Feb 2018 16:19:22 -0800

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8894 )


Change subject: IMPALA-6372: Go parallel for Hive dataload
......................................................................


Patch Set 7:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py
File bin/load-data.py:

http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@39
PS7, Line 39: from multiprocessing import Pool
> If you're using multiprocessing to offload work that doesn't happen in Pyth
Switched to multiprocessing.pool.ThreadPool. Verified that an exception in one 
of the threads produces a usable stack trace.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@41
PS7, Line 41: logging.basicConfig()
            : LOG = logging.getLogger('load-data.py')
            : LOG.setLevel(logging.DEBUG)
This configures Python logging. I modified it so that messages are prefixed by 
the time.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@132
PS7, Line 132:     if (out_file):
> nit: you don't need the parens.
Done


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@148
PS7, Line 148:   if not os.path.exists(file_name): return False
> Is this interesting that we should print "File not found" or something?
Good point. Added a messege.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@174
PS7, Line 174:   if (is_success):
> nit: no parens
Done


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@190
PS7, Line 190:   if not os.path.exists(file_name): return False
> same as above; log the filename?
Done


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@212
PS7, Line 212:   if (is_success):
> nit: no parens
Done


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@283
PS7, Line 283: def exec_query_files_parallel(thread_pool, query_files, 
execution_type):
> pydoc?
Added


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@290
PS7, Line 290:     assert(False), "execution_type must be 'hive' or 'impala'"
> I don't think this can happen given line 284.
Done


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@292
PS7, Line 292:   for result in thread_pool.imap_unordered(execution_function, 
query_files):
> Given that you're using a thread pool, do you have issues with the print st
This changes the logging so that each SQL file puts most of its logging into 
its own log file (i.e. a.sql -> a.sql.log) . The actual output back to stdout 
is just one print when starting to execute a SQL file and one print when the 
SQL file is done. The rate should be low.

Switched most print statements to python logging. Most are at the INFO level.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@301
PS7, Line 301:   LOG.debug(' '.join(sys.argv))
> Is this printed? I don't see where logging is configured...
Looks like it is:
DEBUG:load-data.py:/home/joe/view3/Impala/bin/load-data.py --workloads tpch -e 
core --impalad localhost:21000 --hive_hs2_hostport localhost:11050 
--hdfs_namenode localhost:20500


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@341
PS7, Line 341:     impala_create_files = [f for f in dataset_dir_contents if 
create_filename in f]
> It took me a second to realize that "create_filename in f" is a substring s
Restructured these to be a bit clearer.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@333
PS7, Line 333:     copy_avro_schemas_to_hdfs(os.path.join(sql_dir, 
AVRO_SCHEMA_DIR))
             :     dataset_dir_contents = [os.path.join(sql_dir, f) for f in 
os.listdir(sql_dir)]
             :     load_file_substr = "%s-%s" % (workload, 
options.exploration_strategy)
             :     # Data loading for both Impala and Hive is done in parallel, 
each file format has a
             :     # separate query file.
             :     create_filename = 'create-%s-impala-generated' % 
load_file_substr
             :     hive_load_filename = '%s-hive-generated' % load_file_substr
             :     load_filename = 'load-%s-impala-generated' % load_file_substr
             :     impala_create_files = [f for f in dataset_dir_contents if 
create_filename in f]
             :     hive_load_files = [f for f in dataset_dir_contents if 
hive_load_filename in f]
             :     impala_load_files = [f for f in dataset_dir_contents if 
load_filename in f]
             :
             :     # Execute the data
> Since it's probably in your head, could you take a stab at documenting what
Restructured this to add some comments and make things clearer. I also changed 
generate-schema-statements.py to get it up to date.


http://gerrit.cloudera.org:8080/#/c/8894/7/bin/load-data.py@375
PS7, Line 375:     exec_query_files_parallel(thread_pool, impala_load_files, 
'impala')
> It doesn't matter that this isn't in the if in line 373 because doing an em
Moved this inside the if.


http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@460
PS7, Line 460:     return bool(self.create) or bool(self.load) or 
bool(self.load_base)
> I think this is:
I'm not sure how important the datatypes are, but or of lists is not a bool.
e.g.
a = ['a']
b = ['b']
a or b -> 'a'
My guess is that this would still work, but it is a bit weird.

Changed this to cleaner version:
bool(self.create or self.load or self.load_base)


http://gerrit.cloudera.org:8080/#/c/8894/7/testdata/bin/generate-schema-statements.py@641
PS7, Line 641:       if (output == hive_output):
> nit: no parens
Done



--
To view, visit http://gerrit.cloudera.org:8080/8894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Gerrit-Change-Number: 8894
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Philip Zeyliger <[email protected]>
Gerrit-Comment-Date: Tue, 27 Feb 2018 00:18:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6372: Go parallel for Hive dataload

Reply via email to