Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10120 )
Change subject: IMPALA-6899: Optimize the HDFS commands used in dataload ...................................................................... IMPALA-6899: Optimize the HDFS commands used in dataload HDFS commandline calls can be expensive due to JVM startup and other costs. Since most HDFS commandline calls can take multiple paths, one way to reduce execution time is to consolidate multiple HDFS commands into a single HDFS call. Since HDFS put commands will follow symbolic links and can copy recursively, this can allow for further consolidation by creating the full directory structure and copying it in a single HDFS call. This does several of these optimizations throughout the dataload codepath. It saves a few seconds here and there: Loading Hive Builtins: 1:10 -> 0:30 Loading custom schemas: 0:35 -> 0:20 Loading Hive UDFs: 0:45 -> 0:25 Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282 Reviewed-on: http://gerrit.cloudera.org:8080/10120 Reviewed-by: Philip Zeyliger <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M testdata/bin/copy-udfs-udas.sh M testdata/bin/create-load-data.sh M testdata/bin/load-hive-builtins.sh 3 files changed, 131 insertions(+), 122 deletions(-) Approvals: Philip Zeyliger: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/10120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282 Gerrit-Change-Number: 10120 Gerrit-PatchSet: 5 Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
