Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8894 )

Change subject: PREVIEW: IMPALA-6372: Go parallel for Hive dataload
......................................................................


Patch Set 6:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py
File bin/load-data.py:

http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@a114
PS6, Line 114:
             :
             :
             :
             :
             :
             :
             :
             :
             :
This comment is still needed with Impyla HS2, add to exec_hive_query_from_file.


http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@42
PS6, Line 42: num_processes = multiprocessing.cpu_count()
It might make sense for this to be a command line argument.


http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@151
PS6, Line 151:       with connect(host=hs2_host, port=hs2_port, 
auth_mechanism="PLAIN", \
             :                    user=getpass.getuser(), database="default") 
as conn:
Currently only works for non-secure cluster. This a regression compared to the 
Beeline execution, but even the current Beeline command doesn't work if the 
cluster has SSL. We never use this script on a secure cluster now, but we will 
want to in future.

Revert to Beeline or add long comment about making this work with secure.


http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@163
PS6, Line 163: out_file.write("ERROR: {0}\n{1}\n".format(query, str(e)))
Might be nice to rename the file to be distinctive in the case of an error. 
This would make debugging an error easier.


http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@275
PS6, Line 275: print "Execution of {0} SQL file {1} 
failed.".format(execution_type, query_file)
This should include information about the log file to look at.


http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@280
PS6, Line 280: def get_files_by_pattern(directory_contents, file_pattern):
             :   return [f for f in directory_contents if file_pattern in f]
Unused. Remove.


http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@315
PS6, Line 315:     #os.chdir(sql_dir)
Remove and clarify the need for absolute paths rather than relative paths.


http://gerrit.cloudera.org:8080/#/c/8894/6/bin/load-data.py@336
PS6, Line 336:     # Use maximum parallelism for this part, as these are almost 
entirely metadata
             :     # operations.
Comment out of date. Remove


http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/bin/generate-schema-statements.py@627
PS6, Line 627: invalidate_table_stmt = "INVALIDATE METADATA 
{0}.{1};\n".format(db, table_name)
             :       impala_invalidate.create.append(invalidate_table_stmt)
Some tables should be able to use refresh instead. I think only Hive-created 
tables need invalidate.


http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/8894/6/testdata/datasets/functional/functional_schema_template.sql@330
PS6, Line 330: USE {db_name}{db_suffix};
Not needed, remove.



--
To view, visit http://gerrit.cloudera.org:8080/8894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Gerrit-Change-Number: 8894
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Comment-Date: Sat, 06 Jan 2018 02:15:57 +0000
Gerrit-HasComments: Yes

Reply via email to