[ https://issues.apache.org/jira/browse/IMPALA-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587876#comment-16587876 ]
Pooja Nilangekar edited comment on IMPALA-7418 at 8/22/18 12:32 AM: -------------------------------------------------------------------- Looking at the latest repro of this issues, It differs from IMPALA-7335 in that the the query here is aborted instead of completing successfully. Also, the race here is across fragment instances while in case of IMPALA-7335, the race is between multiple scanner threads of the same fragment instance. In case of IMPALA-7335, the an arbitrary scanner thread sets the done_ flag to true instead of the thread which ran into the error. In this case, the coordinator receives the status update that a backend has completed, however the status_ at the coordinator is set to "Cancel" instead of the error status. Here are the relevant logs at the coordinator: {code:java} I0817 02:38:23.368098 30799 coordinator.cc:685] Backend completed: host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 remaining=3 query_id=c8478c1fd17dd0a0:66545b4000000000 I0817 02:38:23.368122 30799 coordinator-backend-state.cc:254] query_id=c8478c1fd17dd0a0:66545b4000000000: first in-progress backend: shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000 I0817 02:38:23.368155 30799 coordinator.cc:498] ExecState: query id=c8478c1fd17dd0a0:66545b4000000000 finstance=c8478c1fd17dd0a0:66545b4000000001 on host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 (EXECUTING -> ERROR) status=Cancelled I0817 02:38:23.368180 30799 coordinator-backend-state.cc:377] sending CancelQueryFInstances rpc for query_id=c8478c1fd17dd0a0:66545b4000000000 backend=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000 {code} Here are the logs which correspond to the executor: {code:java} I0817 02:38:23.339236 12265 hdfs-scan-node.cc:423] Error preparing scanner for scan range hdfs://localhost:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt(0:20853). I0817 02:38:23.365371 12265 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = BadExpr2 prepare error for Scan node (id=0, status_ = ok, done_ = 0) I0817 02:38:23.365622 12272 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, done_ = 1) I0817 02:38:23.365826 12254 query-exec-mgr.cc:155] ReleaseQueryState(): query_id=c8478c1fd17dd0a0:66545b4000000000 refcnt=2 I0817 02:38:23.368677 12270 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, done_ = 1) I0817 02:38:23.370692 12276 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, done_ = 1) I0817 02:38:23.371484 12257 query-state.cc:332] Cancelling fragment instances as directed by the coordinator. Returned status: Cancelled {code} was (Author: poojanilangekar): Looking at the latest repro of this issues, It differs from IMPALA-7335 in that the the query here is aborted instead of completing successfully. Also, the race here is across fragment instances while in case of IMPALA-7335, the race is between multiple scanner threads of the same fragment instance. In case of IMPALA-7335, the an arbitrary scanner thread sets the done_ flag to true instead of the thread which ran into the error. In this case, the coordinator receives the status update that a backend has completed, however the status_ at the coordinator is set to "Cancel" instead of the error status. Here are the relevant logs at the coordinator: {code:java} I0817 02:38:23.368098 30799 coordinator.cc:685] Backend completed: host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 remaining=3 query_id=c8478c1fd17dd0a0:66545b4000000000 I0817 02:38:23.368122 30799 coordinator-backend-state.cc:254] query_id=c8478c1fd17dd0a0:66545b4000000000: first in-progress backend: shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000 I0817 02:38:23.368155 30799 coordinator.cc:498] ExecState: query id=c8478c1fd17dd0a0:66545b4000000000 finstance=c8478c1fd17dd0a0:66545b4000000001 on host=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22002 (EXECUTING -> ERROR) status=Cancelled I0817 02:38:23.368180 30799 coordinator-backend-state.cc:377] sending CancelQueryFInstances rpc for query_id=c8478c1fd17dd0a0:66545b4000000000 backend=shared-centos64-ec2-m2-4xlarge-ondemand-0c04.vpc.cloudera.com:22000 {code} Here are the logs which correspond to executor: {code:java} I0817 02:38:23.339236 12265 hdfs-scan-node.cc:423] Error preparing scanner for scan range hdfs://localhost:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt(0:20853). I0817 02:38:23.365371 12265 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = BadExpr2 prepare error for Scan node (id=0, status_ = ok, done_ = 0) I0817 02:38:23.365622 12272 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, done_ = 1) I0817 02:38:23.365826 12254 query-exec-mgr.cc:155] ReleaseQueryState(): query_id=c8478c1fd17dd0a0:66545b4000000000 refcnt=2 I0817 02:38:23.368677 12270 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, done_ = 1) I0817 02:38:23.370692 12276 hdfs-scan-node.cc:336] Non-ok status returned by ProcessSplit = Cancelled for Scan node (id=0, status_ = BadExpr2 prepare error, done_ = 1) I0817 02:38:23.371484 12257 query-state.cc:332] Cancelling fragment instances as directed by the coordinator. Returned status: Cancelled {code} > test_udf_errors - returns Cancelled instead of actual error > ----------------------------------------------------------- > > Key: IMPALA-7418 > URL: https://issues.apache.org/jira/browse/IMPALA-7418 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 3.1.0 > Reporter: Tim Armstrong > Assignee: Pooja Nilangekar > Priority: Blocker > Labels: broken-build > > {noformat} > query_test.test_udfs.TestUdfExecution.test_udf_errors[exec_option: > {'disable_codegen_rows_threshold': 0, 'disable_codegen': True, > 'exec_single_node_rows_threshold': 0, 'enable_expr_rewrites': True} | > table_format: text/none] (from pytest) > Failing for the past 1 build (Since Failed#2925 ) > Took 19 sec. > add description > Error Message > query_test/test_udfs.py:415: in test_udf_errors > self.run_test_case('QueryTest/udf-errors', vector, use_db=unique_database) > common/impala_test_suite.py:412: in run_test_case > self.__verify_exceptions(test_section['CATCH'], str(e), use_db) > common/impala_test_suite.py:290: in __verify_exceptions (expected_str, > actual_str) E AssertionError: Unexpected exception string. Expected: > BadExpr2 prepare error E Not found in actual: ImpalaBeeswaxException: Query > aborted:Cancelled > Stacktrace > query_test/test_udfs.py:415: in test_udf_errors > self.run_test_case('QueryTest/udf-errors', vector, use_db=unique_database) > common/impala_test_suite.py:412: in run_test_case > self.__verify_exceptions(test_section['CATCH'], str(e), use_db) > common/impala_test_suite.py:290: in __verify_exceptions > (expected_str, actual_str) > E AssertionError: Unexpected exception string. Expected: BadExpr2 prepare > error > E Not found in actual: ImpalaBeeswaxException: Query aborted:Cancelled > Standard Error > SET sync_ddl=False; > -- executing against localhost:21000 > DROP DATABASE IF EXISTS `test_udf_errors_be4e0293` CASCADE; > MainThread: Started query bd4790b45c20640d:9c62ffba00000000 > SET sync_ddl=False; > -- executing against localhost:21000 > CREATE DATABASE `test_udf_errors_be4e0293`; > MainThread: Started query 474595a3ecba67bd:7a14c8400000000 > MainThread: Created database "test_udf_errors_be4e0293" for test ID > "query_test/test_udfs.py::TestUdfExecution::()::test_udf_errors[exec_option: > {'disable_codegen_rows_threshold': 0, 'disable_codegen': True, > 'exec_single_node_rows_threshold': 0, 'enable_expr_rewrites': True} | > table_format: text/none]" > -- executing against localhost:21000 > use test_udf_errors_be4e0293; > MainThread: Started query 264b0cd09d289c09:cc5dafed00000000 > SET disable_codegen_rows_threshold=0; > SET disable_codegen=True; > SET exec_single_node_rows_threshold=0; > SET enable_expr_rewrites=True; > -- executing against localhost:21000 > create function if not exists hive_pi() returns double > location '/test-warehouse/hive-exec.jar' > symbol='org.apache.hadoop.hive.ql.udf.UDFPI'; > MainThread: Started query ba41ccb6f020becd:db23209f00000000 > -- executing against localhost:21000 > create function if not exists foo() returns double > location '/test-warehouse/not-a-real-file.so' > symbol='FnDoesNotExist'; > -- executing against localhost:21000 > create function if not exists foo() returns double > location '/test-warehouse/not-a-real-file.so' > symbol='FnDoesNotExist'; > -- executing against localhost:21000 > create function if not exists foo (string, string) returns string location > '/test-warehouse/test_udf_errors_be4e0293_bad_udf.ll' symbol='MyAwesomeUdf'; > -- executing against localhost:21000 > create function if not exists twenty_args(int, int, int, int, int, int, > int, int, int, int, int, int, int, int, int, int, int, int, int, int) > returns int > location '/test-warehouse/libTestUdfs.so' > symbol='TwentyArgs'; > MainThread: Started query 6b4dc82f22e2f0f6:9d28ab0300000000 > -- executing against localhost:21000 > select twenty_args(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20); > MainThread: Started query d40ea0f1effacd1:22e5c31f00000000 > -- executing against localhost:21000 > create function if not exists twenty_one_args(int, int, int, int, int, int, > int, int, int, int, int, int, int, int, int, int, int, int, int, int, > int) returns int > location '/test-warehouse/libTestUdfs.so' > symbol='TwentyOneArgs'; > MainThread: Started query 12453a7e4b13fa4d:d163be3300000000 > -- executing against localhost:21000 > select twenty_one_args(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21); > MainThread: Started query 26461e2ce5ce3adf:3544166a00000000 > -- executing against localhost:21000 > create function if not exists nine_args_ir(int, int, int, int, int, int, > int, int, int) returns int > location '/test-warehouse/test-udfs.ll' > symbol='NineArgs'; > MainThread: Started query 5b4599e5b1728cf3:47d44f2100000000 > -- executing against localhost:21000 > select nine_args_ir(1,2,3,4,5,6,7,8,9); > MainThread: Started query a64b4c1f588c7201:3ecfc39600000000 > -- executing against localhost:21000 > create function if not exists bad_expr(double) returns boolean > location '/test-warehouse/libTestUdfs.so' > symbol='BadExpr' prepare_fn='BadExprPrepare' close_fn='BadExprClose'; > MainThread: Started query 9649b1c86826ffba:ef97b6bb00000000 > -- executing against localhost:21000 > create function if not exists bad_expr2(double) returns boolean > location '/test-warehouse/libTestUdfs.so' > symbol='BadExpr' prepare_fn='BadExpr2Prepare' close_fn='BadExprClose'; > MainThread: Started query ec42f91cdc2b0a0e:53c863d800000000 > -- executing against localhost:21000 > select count(t1.int_col) from functional.alltypes t1 join functional.alltypes > t2 > on (bad_expr(rand()) = (t2.bool_col && t1.bool_col)); > MainThread: Started query 9b4a86470ad7a5ba:f3c62e6000000000 > -- executing against localhost:21000 > select count(t1.int_col) from functional.alltypes t1 join functional.alltypes > t2 > on (bad_expr2(rand()) = (t2.bool_col && t1.bool_col)); > MainThread: Started query e24faf2aa198849f:bccd6d6000000000 > -- executing against localhost:21000 > select count(int_col) from functional.alltypes where bad_expr(rand()); > MainThread: Started query f6400e152de0995a:a2fc1d4800000000 > -- executing against localhost:21000 > select count(int_col) from functional.alltypes where bad_expr2(rand()); > MainThread: Started query 4f4c1edb15bde2c3:9e247d1400000000 > 18/08/09 03:15:22 INFO fs.TrashPolicyDefault: Moved: > 'hdfs://localhost:20500/test-warehouse/test_udf_errors_be4e0293_bad_udf.ll' > to trash at: > hdfs://localhost:20500/user/ubuntu/.Trash/Current/test-warehouse/test_udf_errors_be4e0293_bad_udf.ll > {noformat} > This started failing on Ubuntu 16.04 GVOs. E.g. > https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/2925 > https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/2923/ > I can reproduce locally after looping for ~4-5 iterations on average with > {noformat} > while impala-py.test tests/query_test/test_udfs.py -k errors -n4 --verbose; > do date; done > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org