[jira] [Updated] (IMPALA-7088) Parallel data load breaks load-data.py if loading data on a real cluster

David Knupp (JIRA) Tue, 29 May 2018 10:13:08 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Knupp updated IMPALA-7088:
--------------------------------
    Description: 
{{Impala/bin/load-data.py}} is most commonly used to load test data onto a 
simulated standalone cluster running on the local host. However, with the 
correct inputs, it can also be used to load data onto an actual remote cluster.

A recent enhancement in the load-data.py script to parallelize parts of the 
data loading process -- https://github.com/apache/impala/commit/d481cd48 -- has 
introduced a regression in the latter use case:

>From {{$IMPALA_HOME/logs/data_loading/data-load-functional-exhaustive.log}}:
{noformat}
Created table functional_hbase.widetable_1000_cols
Took 0.7121 seconds
09:48:01 Beginning execution of hive SQL: 
/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/logs/data_loading/sql/functional/load-functional-query-exhaustive-hive-generated-text-none-none.sql
Traceback (most recent call last):
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 494, in <module>
    if __name__ == "__main__": main()
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 468, in main
    hive_exec_query_files_parallel(thread_pool, hive_load_text_files)
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 299, in hive_exec_query_files_parallel
    exec_query_files_parallel(thread_pool, query_files, 'hive')
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 290, in exec_query_files_parallel
    for result in thread_pool.imap_unordered(execution_function, query_files):
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
    raise value
TypeError: coercing to Unicode: need string or buffer, NoneType found
{noformat}


  was:
Impala/bin/load-data.py is most commonly used to load test data onto a 
simulated standalone cluster running on the local host. However, with the 
correct inputs, it can also be used to load data onto an actual remote cluster.

A recent enhancement in the load-data.py script to parallelize parts of the 
data loading process -- https://github.com/apache/impala/commit/d481cd48 -- has 
introduced a regression in the latter use case:

>From *$IMPALA_HOME/logs/data_loading/data-load-functional-exhaustive.log*:
{noformat}
Created table functional_hbase.widetable_1000_cols
Took 0.7121 seconds
09:48:01 Beginning execution of hive SQL: 
/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/logs/data_loading/sql/functional/load-functional-query-exhaustive-hive-generated-text-none-none.sql
Traceback (most recent call last):
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 494, in <module>
    if __name__ == "__main__": main()
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 468, in main
    hive_exec_query_files_parallel(thread_pool, hive_load_text_files)
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 299, in hive_exec_query_files_parallel
    exec_query_files_parallel(thread_pool, query_files, 'hive')
  File 
"/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
 line 290, in exec_query_files_parallel
    for result in thread_pool.imap_unordered(execution_function, query_files):
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
    raise value
TypeError: coercing to Unicode: need string or buffer, NoneType found
{noformat}



> Parallel data load breaks load-data.py if loading data on a real cluster
> ------------------------------------------------------------------------
>
>                 Key: IMPALA-7088
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7088
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.0
>            Reporter: David Knupp
>            Priority: Blocker
>
> {{Impala/bin/load-data.py}} is most commonly used to load test data onto a 
> simulated standalone cluster running on the local host. However, with the 
> correct inputs, it can also be used to load data onto an actual remote 
> cluster.
> A recent enhancement in the load-data.py script to parallelize parts of the 
> data loading process -- https://github.com/apache/impala/commit/d481cd48 -- 
> has introduced a regression in the latter use case:
> From {{$IMPALA_HOME/logs/data_loading/data-load-functional-exhaustive.log}}:
> {noformat}
> Created table functional_hbase.widetable_1000_cols
> Took 0.7121 seconds
> 09:48:01 Beginning execution of hive SQL: 
> /home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/logs/data_loading/sql/functional/load-functional-query-exhaustive-hive-generated-text-none-none.sql
> Traceback (most recent call last):
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 494, in <module>
>     if __name__ == "__main__": main()
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 468, in main
>     hive_exec_query_files_parallel(thread_pool, hive_load_text_files)
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 299, in hive_exec_query_files_parallel
>     exec_query_files_parallel(thread_pool, query_files, 'hive')
>   File 
> "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py",
>  line 290, in exec_query_files_parallel
>     for result in thread_pool.imap_unordered(execution_function, query_files):
>   File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
>     raise value
> TypeError: coercing to Unicode: need string or buffer, NoneType found
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-7088) Parallel data load breaks load-data.py if loading data on a real cluster

Reply via email to