- **Type**: defect --> enhancement


---

** [tickets:#253] mds : 1.5 sec wait added in RSP send causes problems in MDS 
clients**

**Status:** assigned
**Milestone:** future
**Created:** Thu May 16, 2013 06:54 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu May 16, 2013 06:56 AM UTC
**Owner:** A V Mahesh (AVM)

from http://devel.opensaf.org/ticket/2825


Single threaded LOG server stalled waiting for file system for a longer time 
than 10 sec which is the sync tmo in the LOG library. This causes LOG clients 
(e.g. NTF server) to timeout and retry. This creates a backlog of outdated 
messages in the LOG server mailbox. When those eventually are handled, the 1.5 
sec in MDS is added to each RSP send. Therefore the LOG server never catch up 
with received messages in the mailbox. 


The change introduced in #2611 introduced an unacceptable hidden delay when 
sending messages that can have consequences for any client with soft real time 
requirements. For example AMF HC timeouts.


References:
http://devel.opensaf.org/ticket/2611
 http://list.opensaf.org/pipermail/devel/2012-April/022254.html


Workaround:
LOG server throws away "rotten" messages that are older than 10 sec.


Proposed long term solution:
MDS should buffer incoming data messages until the corresponding SVC up message 
is received and potentially delivered to the client.


Replying to hafe:


Single threaded LOG server stalled waiting for file system for a longer time 
than 10 sec which is the sync tmo in the LOG library. This causes LOG clients 
(e.g. NTF server) to timeout and retry. 


LOG service or any other service(like dtsv) that does disk i/o are prone to 
these situations. 


This creates a backlog of outdated messages in the LOG server mailbox. When 
those eventually are handled, the 1.5 sec in MDS is added to each RSP send. 
Therefore the LOG server never catch up with received messages in the mailbox. 


This is a case of a "slow receiver". More in the next comment



The change introduced in #2611 introduced an unacceptable hidden delay when 
sending messages that can have consequences for any client with soft real time 
requirements. For example AMF HC timeouts.


I don't think that change(in MDS) can 'directly and always' result in making 
LOG a 'slow transmitter'! Because, the 1.5 seconds i believe is only when the 
MDS client startsup, like during a node bootup.


Having said that, such services that are dependent on responses from external 
resources(modules) like disk i/o in this case, should be tuned to have 
generally bigger healthcheck timeouts.


Surya, could you please comment on Hans' theory on the 1.5 seconds.



References:
http://devel.opensaf.org/ticket/2611
 http://list.opensaf.org/pipermail/devel/2012-April/022254.html

Workaround:
LOG server throws away "rotten" messages that are older than 10 sec.

Proposed long term solution:
MDS should buffer incoming data messages until the corresponding SVC up message 
is received and potentially delivered to the client.


  Changed 8 months ago by mathi ¶
  I mean, if we try to formulate and understand the problem....


If the problem is health check timeouts we should do the following


•increase the timeout for healthcheck, and 
•if necessary, introduce a separate healthcheck thread. 
If the problem is about clients' receiving retry, then these situations would 
occur typically when the shared filesystem is/was undergoing a role change or 
is in the process of some heavy sync operation, etc. In such situations, 
returning TRY_AGAIN is a genuine way of handling such situations (typically 
these situations can occur only during an upgrade kind of scenario that might 
involve role change or when some fault at the disk level and not during normal 
lifecycle when the healthchecks.)


If the problem is timeout that which is caused by the slow processing, then we 
could think of introducing some protocol between the LGA and LGS to improve the 
congestion, i mean i'm tending to think in this angle, the end solution may 
involve LGA, LGS or even MDS but i think the problems being describe here would 
have occurred even without the 2611 and as such 2611 cannot contribute much to 
this problem getting formulated in this ticket.


Having said that, throwing away older messages shouldn't be a problem, but i'm 
trying to understand how could that improve the situation...


  Changed 7 months ago by nagendra ¶
  ■owner changed from surya to nagendra 
■status changed from new to accepted 
  Changed 7 months ago by nagendra ¶
  ■owner changed from nagendra to surya 
■status changed from accepted to assigned 
  Changed 7 months ago by surya ¶
  ■status changed from assigned to accepted 
  Changed 7 months ago by surya ¶
  ■patch_waiting changed from no to yes 
  Changed 7 months ago by mahesh ¶
  Steps to test:


1)Pause osaflogd process (# kill -STOP <osaflogd PID> )
2)Write to system stream using saflogger tool(#/usr/local/bin/saflogger -y 


"Out of ourder test" )


3)Allow saflogger saLogInitialize FAILED
4)Continues a stopped osaflogd process (#kill -CONT <osaflogd PID> )
5)Observer mds_mdtm_query_dest_tipc() logs ( Current patch dosne have log ,


so need add some syslog in mds_mdtm_query_dest_tipc() 






---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to