Hi Canh Van,

Please check my comments as [AVM].

-AVM

On 5/10/2017 12:06 PM, Canh Van Truong wrote:
>
> Hi Mahesh,
>
> Thanks for your update.
>
> Please check my comments.
>
> Regards
>
> Canh
>
> -----Original Message-----
> From: A V Mahesh [mailto:[email protected]]
> Sent: Wednesday, May 10, 2017 6:21 AM
> To: Canh Van Truong <[email protected]>; 
> [email protected]; [email protected]
> Cc: [email protected]
> Subject: Re: [PATCH 1/1] log: fix agent gets TRY_AGAIN instead TIMEOUT 
> during failover [#2411]
>
> Hi Canh Van,
>
> The update one:
>
> On 5/5/2017 6:37 PM, Canh Van Truong wrote:
>
> > During testing, e.g: use saflogger in loop to send a log record to 
> log service
>
> > during failover, we frequently encounter saLogStreamClose() or 
> saLogFinalize()
>
> > get SA_AIS_ERR_TIMEOUT as active LOG service is shutdown while agent 
> requests
>
> > still remain in mailbox.
>
> > The close request has come to active LOG, means it is put to the 
> mailbox but not
>
> > yet pick up out for processing yet as LOGsv is just getting TERM 
> signal, then short
>
> >   time later, it does kill itself.
>
> > The LOG agent did not get the ACK response for sync close request, 
> and after 10s expired,
>
> > MDS returns TIMEOUT back to the log agent.
>
> > The solution for this is that, before calling daemon_exit, iterate 
> all items in its mailbox,
>
> > if that is agent request, send response with try again error to 
> agent before going to shutdown
>
> >Application interprets both differently  , it is true in both cases
>
> >call  is NOT succeeded,our current
>
> >understanding is SA_AIS_ERR_TRY_AGAIN is to be send in very shorter
>
> >delay of service some millisecond,
>
> [Canh] The TRY_AGAIN means that the service cannot provide at this 
> time. The role is in transition at this time.
>
> The log service of active node are going to terminating. The 
> application can send request again later in delay of time when standby 
> node
>
> up to ACTIVE.
>
[AVM] NO,  fail-over is NOT role change of original destination , new 
destination is taking over the role  ( time is unpredictable)
              and let us take condition  of double fault , application 
will in continues TRY_AGAIN.

> >if SA_AIS_ERR_TIMEOUT   comes mean destination is not reachable ( and we
>
> >don't have any concept
>
> >of short time non reachable fail-over takes it time) this means
>
> >application should not retry in millisecond, it should
>
> >evaluate event that occurred in cluster based not that application may
>
> >try or may not try again so let the application handle
>
> >it differently.
>
> [Canh]  The TIMEOUT means that the requests may be succeeded or 
> whether it did not. In this case, we know that the request is not
>
> succeeded to be processed. It’s in mailbox of logs of the node are 
> going to terminating and it will be killed itself when the log service 
> is terminated.
>
> Because the FD_TERM signal event is processed before the FD_MBX event.
>
> The mds send the return  TIMEOUT to log agent after timeout of request.
>
>  So I fix to return TRY_AGAIN instead of TIMEOUT here
>
>  Do I have any wrong things or misunderstand here?
>

[AVM]  yes,  you re trying  merge/optimize the TRY_AGAIN and TIMEOUT in 
to same batch
TRY_AGAIN is used for very shorter delay of service some millisecond,  
and in fail-over  case delay of service unpredictable
if double fault  occurs , application will in continues TRY_AGAIN and 
clueless and not recoverable .

So let use us not confuse the application ,  and allow Application 
interprets both differently .

> >=====================================================================================================
>
> >SA_AIS_ERR_TIMEOUT - An implementation-dependent timeout occurred before
>
> >the call could complete. It is unspecified whether the call succeeded or
>
> >whether it did
>
> >not.
>
> >SA_AIS_ERR_TRY_AGAIN - The service cannot be provided at this time. The
>
> >process
>
> >may retry later.
>
> >=====================================================================================================
>
> -AVM
>
> > ---
>
> >   src/log/logd/lgs_cb.h    |  1 +
>
> >   src/log/logd/lgs_evt.cc  | 25 ++++++++++++++-----------
>
> >   src/log/logd/lgs_main.cc |  8 ++++++++
>
> >   3 files changed, 23 insertions(+), 11 deletions(-)
>
> >
>
> > diff --git a/src/log/logd/lgs_cb.h b/src/log/logd/lgs_cb.h
>
> > index bfd2822cc..b2a29bad9 100644
>
> > --- a/src/log/logd/lgs_cb.h
>
> > +++ b/src/log/logd/lgs_cb.h
>
> > @@ -81,6 +81,7 @@ typedef struct lgs_cb {
>
> >     SaInvocationT
>
> >         amf_invocation_id; /* AMF InvocationID - needed to handle 
> Quiesed state */
>
> >     bool is_quiesced_set;
>
> > +  bool is_terminating_set;  /* Flag for osaflogd are terminating */
>
> >     SaImmOiHandleT immOiHandle; /* IMM OI 
> handle                           */
>
> >     SaSelectionObjectT
>
> >         immSelectionObject; /* Selection Object to wait for IMM 
> events */
>
> > diff --git a/src/log/logd/lgs_evt.cc b/src/log/logd/lgs_evt.cc
>
> > index 98ca5f71c..7a0614b43 100644
>
> > --- a/src/log/logd/lgs_evt.cc
>
> > +++ b/src/log/logd/lgs_evt.cc
>
> > @@ -710,8 +710,8 @@ static uint32_t proc_initialize_msg(lgs_cb_t 
> *cb, lgsv_lgs_evt_t *evt) {
>
> >     TRACE_ENTER2("dest %" PRIx64, evt->fr_dest);
>
> >
>
> >     // Client should try again when role changes is in transition.
>
> > -  if (cb->is_quiesced_set) {
>
> > -    TRACE("Log service is in quiesced state");
>
> > +  if (cb->is_quiesced_set || cb->is_terminating_set) {
>
> > +    TRACE("Log service is in quiesced/terminating");
>
> >       ais_rc = SA_AIS_ERR_TRY_AGAIN;
>
> >       goto snd_rsp;
>
> >     }
>
> > @@ -774,8 +774,8 @@ static uint32_t proc_finalize_msg(lgs_cb_t *cb, 
> lgsv_lgs_evt_t *evt) {
>
> >     TRACE_ENTER2("client_id %u", client_id);
>
> >
>
> >     // Client should try again when role changes is in transition.
>
> > -  if (cb->is_quiesced_set) {
>
> > -    TRACE("Log service is in quiesced state");
>
> > +  if (cb->is_quiesced_set || cb->is_terminating_set) {
>
> > +    TRACE("Log service is in quiesced/terminating");
>
> >       ais_rc = SA_AIS_ERR_TRY_AGAIN;
>
> >       goto snd_rsp;
>
> >     }
>
> > @@ -1041,8 +1041,8 @@ static uint32_t proc_stream_open_msg(lgs_cb_t 
> *cb, lgsv_lgs_evt_t *evt) {
>
> > open_sync_param->client_id);
>
> >
>
> >     // Client should try again when role changes is in transition.
>
> > -  if (cb->is_quiesced_set) {
>
> > -    TRACE("Log service is in quiesced state");
>
> > +  if (cb->is_quiesced_set || cb->is_terminating_set) {
>
> > +    TRACE("Log service is in quiesced/terminating");
>
> >       ais_rv = SA_AIS_ERR_TRY_AGAIN;
>
> >       goto snd_rsp;
>
> >     }
>
> > @@ -1199,8 +1199,8 @@ static uint32_t proc_stream_close_msg(lgs_cb_t 
> *cb, lgsv_lgs_evt_t *evt) {
>
> > close_param->lstr_id);
>
> >
>
> >     // Client should try again when role changes is in transition.
>
> > -  if (cb->is_quiesced_set) {
>
> > -    TRACE("Log service is in quiesced state");
>
> > +  if (cb->is_quiesced_set || cb->is_terminating_set) {
>
> > +    TRACE("Log service is in quiesced/terminating");
>
> >       ais_rc = SA_AIS_ERR_TRY_AGAIN;
>
> >       goto snd_rsp;
>
> >     }
>
> > @@ -1299,8 +1299,8 @@ static uint32_t 
> proc_write_log_async_msg(lgs_cb_t *cb, lgsv_lgs_evt_t *evt) {
>
> >                  param->lstr_id, node_name);
>
> >
>
> >     // Client should try again when role changes is in transition.
>
> > -  if (cb->is_quiesced_set) {
>
> > -    TRACE("Log service is in quiesced state");
>
> > +  if (cb->is_quiesced_set || cb->is_terminating_set) {
>
> > +    TRACE("Log service is in quiesced/terminating");
>
> >       error = SA_AIS_ERR_TRY_AGAIN;
>
> >       goto done;
>
> >     }
>
> > @@ -1547,7 +1547,8 @@ void lgs_process_mbx(SYSF_MBX *mbx) {
>
> >       if (lgs_cb->ha_state == SA_AMF_HA_ACTIVE) {
>
> >         if (msg->evt_type <= LGSV_LGS_EVT_LGA_DOWN) {
>
> > lgs_lgsv_top_level_evt_dispatch_tbl[msg->evt_type](msg);
>
> > -      } else if (msg->evt_type == LGSV_EVT_QUIESCED_ACK) {
>
> > +      } else if (msg->evt_type == LGSV_EVT_QUIESCED_ACK &&
>
> > + lgs_cb->is_terminating_set == false) {
>
> > proc_mds_quiesced_ack_msg(msg);
>
> >         } else if (msg->evt_type == LGSV_EVT_NO_OP) {
>
> >           TRACE("Jolted the main thread so it picks up the new IMM FD");
>
> > @@ -1565,5 +1566,7 @@ void lgs_process_mbx(SYSF_MBX *mbx) {
>
> >       }
>
> >
>
> >       lgs_evt_destroy(msg);
>
> > +  } else if (lgs_cb->is_terminating_set) {
>
> > + lgs_cb->is_terminating_set = false;
>
> >     }
>
> >   }
>
> > diff --git a/src/log/logd/lgs_main.cc b/src/log/logd/lgs_main.cc
>
> > index fe2f9a2b8..fe0d31caf 100644
>
> > --- a/src/log/logd/lgs_main.cc
>
> > +++ b/src/log/logd/lgs_main.cc
>
> > @@ -564,6 +564,14 @@ int main(int argc, char *argv[]) {
>
> >       }
>
> >
>
> >       if (fds[FD_TERM].revents & POLLIN) {
>
> > +      // Process all requests in mailbox
>
> > + lgs_cb->is_terminating_set = true;
>
> > +      if (fds[FD_MBX].revents & POLLIN) {
>
> > +        while (lgs_cb->is_terminating_set) {
>
> > + lgs_process_mbx(&lgs_mbx);
>
> > +        }
>
> > +      }
>
> > +
>
> >         daemon_exit();
>
> >       }
>
> >
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to