[ 
https://issues.apache.org/jira/browse/IMPALA-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587876#comment-16587876
 ] 

Pooja Nilangekar edited comment on IMPALA-7418 at 8/22/18 12:32 AM:
--------------------------------------------------------------------

Looking at the latest repro of this issues, It differs from IMPALA-7335 in that 
the the query here is aborted instead of completing successfully. Also, the 
race here is across fragment instances while in case of IMPALA-7335, the race 
is between multiple scanner threads of the same fragment instance. In case of 
IMPALA-7335, the an arbitrary scanner thread sets the done_ flag to true 
instead of the thread which ran into the error. In this case, the coordinator 
receives the status update that a backend has completed, however the status_ at 
the coordinator is set to "Cancel" instead of  the error status.

Here are the relevant logs at the coordinator:
{code:java}
 I0817 02:38:23.368098 30799 coordinator.cc:685] Backend completed: 
host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 
remaining=3 query_id=c8478c1fd17dd0a0:66545b4000000000
 I0817 02:38:23.368122 30799 coordinator-backend-state.cc:254] 
query_id=c8478c1fd17dd0a0:66545b4000000000: first in-progress backend: 
shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000
 I0817 02:38:23.368155 30799 coordinator.cc:498] ExecState: query 
id=c8478c1fd17dd0a0:66545b4000000000 
finstance=c8478c1fd17dd0a0:66545b4000000001 on 
host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 
(EXECUTING -> ERROR) status=Cancelled
 I0817 02:38:23.368180 30799 coordinator-backend-state.cc:377] sending 
CancelQueryFInstances rpc for query_id=c8478c1fd17dd0a0:66545b4000000000 
backend=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000
{code}
 

Here are the logs which correspond to the executor:
{code:java}
I0817 02:38:23.339236 12265 hdfs-scan-node.cc:423] Error preparing scanner for 
scan range 
hdfs://localhost:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt(0:20853).
I0817 02:38:23.365371 12265 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = BadExpr2 prepare error for Scan node (id=0, status_ = ok, done_ 
= 0)
I0817 02:38:23.365622 12272 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, 
done_ = 1)
I0817 02:38:23.365826 12254 query-exec-mgr.cc:155] ReleaseQueryState(): 
query_id=c8478c1fd17dd0a0:66545b4000000000 refcnt=2
I0817 02:38:23.368677 12270 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, 
done_ = 1)
I0817 02:38:23.370692 12276 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, 
done_ = 1)
I0817 02:38:23.371484 12257 query-state.cc:332] Cancelling fragment instances 
as directed by the coordinator. Returned status: Cancelled
{code}


was (Author: poojanilangekar):
Looking at the latest repro of this issues, It differs from IMPALA-7335 in that 
the the query here is aborted instead of completing successfully. Also, the 
race here is across fragment instances while in case of IMPALA-7335, the race 
is between multiple scanner threads of the same fragment instance. In case of 
IMPALA-7335, the an arbitrary scanner thread sets the done_ flag to true 
instead of the thread which ran into the error. In this case, the coordinator 
receives the status update that a backend has completed, however the status_ at 
the coordinator is set to "Cancel" instead of  the error status. 

Here are the relevant logs at the coordinator:
{code:java}
 I0817 02:38:23.368098 30799 coordinator.cc:685] Backend completed: 
host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 
remaining=3 query_id=c8478c1fd17dd0a0:66545b4000000000
 I0817 02:38:23.368122 30799 coordinator-backend-state.cc:254] 
query_id=c8478c1fd17dd0a0:66545b4000000000: first in-progress backend: 
shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000
 I0817 02:38:23.368155 30799 coordinator.cc:498] ExecState: query 
id=c8478c1fd17dd0a0:66545b4000000000 
finstance=c8478c1fd17dd0a0:66545b4000000001 on 
host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 
(EXECUTING -> ERROR) status=Cancelled
 I0817 02:38:23.368180 30799 coordinator-backend-state.cc:377] sending 
CancelQueryFInstances rpc for query_id=c8478c1fd17dd0a0:66545b4000000000 
backend=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000
{code}
 

Here are the logs which correspond to executor:
{code:java}
I0817 02:38:23.339236 12265 hdfs-scan-node.cc:423] Error preparing scanner for 
scan range 
hdfs://localhost:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt(0:20853).
I0817 02:38:23.365371 12265 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = BadExpr2 prepare error for Scan node (id=0, status_ = ok, done_ 
= 0)
I0817 02:38:23.365622 12272 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, 
done_ = 1)
I0817 02:38:23.365826 12254 query-exec-mgr.cc:155] ReleaseQueryState(): 
query_id=c8478c1fd17dd0a0:66545b4000000000 refcnt=2
I0817 02:38:23.368677 12270 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, 
done_ = 1)
I0817 02:38:23.370692 12276 hdfs-scan-node.cc:336] Non-ok status returned by 
ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, 
done_ = 1)
I0817 02:38:23.371484 12257 query-state.cc:332] Cancelling fragment instances 
as directed by the coordinator. Returned status: Cancelled
{code}

> test_udf_errors - returns Cancelled instead of actual error
> -----------------------------------------------------------
>
>                 Key: IMPALA-7418
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7418
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.1.0
>            Reporter: Tim Armstrong
>            Assignee: Pooja Nilangekar
>            Priority: Blocker
>              Labels: broken-build
>
> {noformat}
> query_test.test_udfs.TestUdfExecution.test_udf_errors[exec_option: 
> {'disable_codegen_rows_threshold': 0, 'disable_codegen': True, 
> 'exec_single_node_rows_threshold': 0, 'enable_expr_rewrites': True} | 
> table_format: text/none] (from pytest)
> Failing for the past 1 build (Since Failed#2925 )
> Took 19 sec.
> add description
> Error Message
> query_test/test_udfs.py:415: in test_udf_errors     
> self.run_test_case('QueryTest/udf-errors', vector, use_db=unique_database) 
> common/impala_test_suite.py:412: in run_test_case     
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db) 
> common/impala_test_suite.py:290: in __verify_exceptions     (expected_str, 
> actual_str) E   AssertionError: Unexpected exception string. Expected: 
> BadExpr2 prepare error E   Not found in actual: ImpalaBeeswaxException: Query 
> aborted:Cancelled
> Stacktrace
> query_test/test_udfs.py:415: in test_udf_errors
>     self.run_test_case('QueryTest/udf-errors', vector, use_db=unique_database)
> common/impala_test_suite.py:412: in run_test_case
>     self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:290: in __verify_exceptions
>     (expected_str, actual_str)
> E   AssertionError: Unexpected exception string. Expected: BadExpr2 prepare 
> error
> E   Not found in actual: ImpalaBeeswaxException: Query aborted:Cancelled
> Standard Error
> SET sync_ddl=False;
> -- executing against localhost:21000
> DROP DATABASE IF EXISTS `test_udf_errors_be4e0293` CASCADE;
> MainThread: Started query bd4790b45c20640d:9c62ffba00000000
> SET sync_ddl=False;
> -- executing against localhost:21000
> CREATE DATABASE `test_udf_errors_be4e0293`;
> MainThread: Started query 474595a3ecba67bd:7a14c8400000000
> MainThread: Created database "test_udf_errors_be4e0293" for test ID 
> "query_test/test_udfs.py::TestUdfExecution::()::test_udf_errors[exec_option: 
> {'disable_codegen_rows_threshold': 0, 'disable_codegen': True, 
> 'exec_single_node_rows_threshold': 0, 'enable_expr_rewrites': True} | 
> table_format: text/none]"
> -- executing against localhost:21000
> use test_udf_errors_be4e0293;
> MainThread: Started query 264b0cd09d289c09:cc5dafed00000000
> SET disable_codegen_rows_threshold=0;
> SET disable_codegen=True;
> SET exec_single_node_rows_threshold=0;
> SET enable_expr_rewrites=True;
> -- executing against localhost:21000
> create function if not exists hive_pi() returns double
> location '/test-warehouse/hive-exec.jar'
> symbol='org.apache.hadoop.hive.ql.udf.UDFPI';
> MainThread: Started query ba41ccb6f020becd:db23209f00000000
> -- executing against localhost:21000
> create function if not exists foo() returns double
> location '/test-warehouse/not-a-real-file.so'
> symbol='FnDoesNotExist';
> -- executing against localhost:21000
> create function if not exists foo() returns double
> location '/test-warehouse/not-a-real-file.so'
> symbol='FnDoesNotExist';
> -- executing against localhost:21000
> create function if not exists foo (string, string) returns string location
> '/test-warehouse/test_udf_errors_be4e0293_bad_udf.ll' symbol='MyAwesomeUdf';
> -- executing against localhost:21000
> create function if not exists twenty_args(int, int, int, int, int, int,
>     int, int, int, int, int, int, int, int, int, int, int, int, int, int) 
> returns int
> location '/test-warehouse/libTestUdfs.so'
> symbol='TwentyArgs';
> MainThread: Started query 6b4dc82f22e2f0f6:9d28ab0300000000
> -- executing against localhost:21000
> select twenty_args(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
> MainThread: Started query d40ea0f1effacd1:22e5c31f00000000
> -- executing against localhost:21000
> create function if not exists twenty_one_args(int, int, int, int, int, int,
>     int, int, int, int, int, int, int, int, int, int, int, int, int, int, 
> int) returns int
> location '/test-warehouse/libTestUdfs.so'
> symbol='TwentyOneArgs';
> MainThread: Started query 12453a7e4b13fa4d:d163be3300000000
> -- executing against localhost:21000
> select twenty_one_args(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21);
> MainThread: Started query 26461e2ce5ce3adf:3544166a00000000
> -- executing against localhost:21000
> create function if not exists nine_args_ir(int, int, int, int, int, int,
>     int, int, int) returns int
> location '/test-warehouse/test-udfs.ll'
> symbol='NineArgs';
> MainThread: Started query 5b4599e5b1728cf3:47d44f2100000000
> -- executing against localhost:21000
> select nine_args_ir(1,2,3,4,5,6,7,8,9);
> MainThread: Started query a64b4c1f588c7201:3ecfc39600000000
> -- executing against localhost:21000
> create function if not exists bad_expr(double) returns boolean
> location '/test-warehouse/libTestUdfs.so'
> symbol='BadExpr' prepare_fn='BadExprPrepare' close_fn='BadExprClose';
> MainThread: Started query 9649b1c86826ffba:ef97b6bb00000000
> -- executing against localhost:21000
> create function if not exists bad_expr2(double) returns boolean
> location '/test-warehouse/libTestUdfs.so'
> symbol='BadExpr' prepare_fn='BadExpr2Prepare' close_fn='BadExprClose';
> MainThread: Started query ec42f91cdc2b0a0e:53c863d800000000
> -- executing against localhost:21000
> select count(t1.int_col) from functional.alltypes t1 join functional.alltypes 
> t2
> on (bad_expr(rand()) = (t2.bool_col && t1.bool_col));
> MainThread: Started query 9b4a86470ad7a5ba:f3c62e6000000000
> -- executing against localhost:21000
> select count(t1.int_col) from functional.alltypes t1 join functional.alltypes 
> t2
> on (bad_expr2(rand()) = (t2.bool_col && t1.bool_col));
> MainThread: Started query e24faf2aa198849f:bccd6d6000000000
> -- executing against localhost:21000
> select count(int_col) from functional.alltypes where bad_expr(rand());
> MainThread: Started query f6400e152de0995a:a2fc1d4800000000
> -- executing against localhost:21000
> select count(int_col) from functional.alltypes where bad_expr2(rand());
> MainThread: Started query 4f4c1edb15bde2c3:9e247d1400000000
> 18/08/09 03:15:22 INFO fs.TrashPolicyDefault: Moved: 
> 'hdfs://localhost:20500/test-warehouse/test_udf_errors_be4e0293_bad_udf.ll' 
> to trash at: 
> hdfs://localhost:20500/user/ubuntu/.Trash/Current/test-warehouse/test_udf_errors_be4e0293_bad_udf.ll
> {noformat}
> This started failing on Ubuntu 16.04 GVOs. E.g. 
> https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/2925 
> https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/2923/
> I can reproduce locally after looping for ~4-5 iterations on average with 
> {noformat}
> while impala-py.test tests/query_test/test_udfs.py -k errors -n4 --verbose; 
> do date; done
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to