[ 
https://issues.apache.org/jira/browse/IMPALA-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926348#comment-16926348
 ] 

ASF subversion and git services commented on IMPALA-8904:
---------------------------------------------------------

Commit 19cb8dc1c1c2247e91adc4bf62cab27a7c1e4381 in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=19cb8dc ]

IMPALA-8904: retry statestore RegisterSubscriber() RPC

Previously connection failures triggered a retry, but
failures on the actual RPC did not trigger a retry. This
change moves the retry loop to DoRpcWithRetry(), instead
of relying on the ClientCache to retry the connection.

Note that DoRpcWithRetry() for thrift was dead code since
most backend RPCs were ported to KRPC, but should still work.

Testing:
Added targeted test with debug action to inject error on first
subscribe RPC.

Change-Id: I5d4e6283b5ec83170a1d1d03075b3384a9f108b5
Reviewed-on: http://gerrit.cloudera.org:8080/14198
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Daemons fails fast when statestore has not started up
> -----------------------------------------------------
>
>                 Key: IMPALA-8904
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8904
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.1.0, Impala 3.2.0, Impala 3.3.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>
> If you start the statestored and the other services at the same time, there 
> is a race between the statestore starting and the other services trying to 
> register with it. If the other services "win" the race, they abort startup 
> because they can't register with the statestore.
> The log looks like.
> {noformat}
> │ I0828 00:19:10.460000     1 statestore-subscriber.cc:219] Starting 
> statestore subscriber                                                         
>                                                                               
>                                                          ││ I0828 
> 00:19:10.461310     1 thrift-server.cc:451] ThriftServer 
> 'StatestoreSubscriber' started on port: 23000                                 
>                                                                               
>                                                              │
> │ I0828 00:19:10.461320     1 statestore-subscriber.cc:247] Registering with 
> statestore                                                                    
>                                                                               
>                                                  ││ I0828 00:19:10.461309   
> 299 TAcceptQueueServer.cpp:314] connection_setup_thread_pool_size is set to 2 
>                                                                               
>                                                                               
>                       │
> │ I0828 00:19:10.462744     1 statestore-subscriber.cc:253] statestore 
> registration unsuccessful: RPC Error: Client for statestored:24000 hit an 
> unexpected exception: No more data to read., type: 
> N6apache6thrift9transport19TTransportExceptionE, rpc: 
> N6impala27TRegisterSubscriberRe ││ sponseE, send: done                        
>                                                                               
>                                                                               
>                                                                               
>    │
> │ E0828 00:19:10.462818     1 impalad-main.cc:90] Impalad services did not 
> start correctly, exiting.  Error: RPC Error: Client for statestored:24000 hit 
> an unexpected exception: No more data to read., type: 
> N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala27TRegisterS ││ 
> ubscriberResponseE, send: done                                                
>                                                                               
>                                                                               
>                                               │
> │ Statestore subscriber did not start up.                                     
>                       
> {noformat}
> Most management systems will automatically restart failed processes, so 
> typically the impalads will come back up and find the statestore, but the 
> crash loop is unnecessary.
> I propose that the services should retry for a while before giving up (we 
> still want the services to fail when there genuinely isn't a statestore 
> available).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to