[jira] [Commented] (IMPALA-7088) Parallel data load breaks load-data.py if loading data on a real cluster

ASF subversion and git services (JIRA) Thu, 31 May 2018 22:59:17 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497611#comment-16497611
 ]


ASF subversion and git services commented on IMPALA-7088:
---------------------------------------------------------

Commit 2eaf74a0535528c93e0794a524271d6ce6b79372 in impala's branch 
refs/heads/2.x from [~joemcdonnell]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=2eaf74a ]

IMPALA-7088: Fix uninitialized variable in cluster dataload

bin/load-data.py uses a unique directory for local Hive
execution to avoid a race condition when executing multiple
Hive commands at once. This unique directory is not needed
when loading on a real cluster. However, the code to remove
the unique directory at the end does not handle this
correctly.

This skips the code to remove the unique directory when
it is uninitialized.

Change-Id: I5581a45460dc341842d77eaa09647e50f35be6c7
Reviewed-on: http://gerrit.cloudera.org:8080/10526
Reviewed-by: Joe McDonnell <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Parallel data load breaks load-data.py if loading data on a real cluster
> ------------------------------------------------------------------------
>
>                 Key: IMPALA-7088
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7088
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.0
>            Reporter: David Knupp
>            Assignee: Joe McDonnell
>            Priority: Blocker
>
> {{Impala/bin/load-data.py}} is most commonly used to load test data onto a 
> simulated standalone cluster running on the local host. However, with the 
> correct inputs, it can also be used to load data onto an actual cluster 
> running on remote hosts.
> A recent enhancement in the load-data.py script to parallelize parts of the 
> data loading process -- https://github.com/apache/impala/commit/d481cd48 -- 
> has introduced a regression in the latter use case:
> From {{$IMPALA_HOME/logs/data_loading/data-load-functional-exhaustive.log}}:
> {noformat}
> Created table functional_hbase.widetable_1000_cols
> Took 0.7121 seconds
> 09:48:01 Beginning execution of hive SQL: 
> /home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/logs/data_loading/sql/functional/load-functional-query-exhaustive-hive-generated-text-none-none.sql
> Traceback (most recent call last):
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 494, in <module>
>     if __name__ == "__main__": main()
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 468, in main
>     hive_exec_query_files_parallel(thread_pool, hive_load_text_files)
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 299, in hive_exec_query_files_parallel
>     exec_query_files_parallel(thread_pool, query_files, 'hive')
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 290, in exec_query_files_parallel
>     for result in thread_pool.imap_unordered(execution_function, query_files):
>   File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
>     raise value
> TypeError: coercing to Unicode: need string or buffer, NoneType found
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-7088) Parallel data load breaks load-data.py if loading data on a real cluster

Reply via email to