- **Type**: defect --> enhancement
---
** [tickets:#253] mds : 1.5 sec wait added in RSP send causes problems in MDS
clients**
**Status:** assigned
**Milestone:** future
**Created:** Thu May 16, 2013 06:54 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu May 16, 2013 06:56 AM UTC
**Owner:** A V Mahesh (AVM)
from http://devel.opensaf.org/ticket/2825
Single threaded LOG server stalled waiting for file system for a longer time
than 10 sec which is the sync tmo in the LOG library. This causes LOG clients
(e.g. NTF server) to timeout and retry. This creates a backlog of outdated
messages in the LOG server mailbox. When those eventually are handled, the 1.5
sec in MDS is added to each RSP send. Therefore the LOG server never catch up
with received messages in the mailbox.
The change introduced in #2611 introduced an unacceptable hidden delay when
sending messages that can have consequences for any client with soft real time
requirements. For example AMF HC timeouts.
References:
http://devel.opensaf.org/ticket/2611
http://list.opensaf.org/pipermail/devel/2012-April/022254.html
Workaround:
LOG server throws away "rotten" messages that are older than 10 sec.
Proposed long term solution:
MDS should buffer incoming data messages until the corresponding SVC up message
is received and potentially delivered to the client.
Replying to hafe:
Single threaded LOG server stalled waiting for file system for a longer time
than 10 sec which is the sync tmo in the LOG library. This causes LOG clients
(e.g. NTF server) to timeout and retry.
LOG service or any other service(like dtsv) that does disk i/o are prone to
these situations.
This creates a backlog of outdated messages in the LOG server mailbox. When
those eventually are handled, the 1.5 sec in MDS is added to each RSP send.
Therefore the LOG server never catch up with received messages in the mailbox.
This is a case of a "slow receiver". More in the next comment
The change introduced in #2611 introduced an unacceptable hidden delay when
sending messages that can have consequences for any client with soft real time
requirements. For example AMF HC timeouts.
I don't think that change(in MDS) can 'directly and always' result in making
LOG a 'slow transmitter'! Because, the 1.5 seconds i believe is only when the
MDS client startsup, like during a node bootup.
Having said that, such services that are dependent on responses from external
resources(modules) like disk i/o in this case, should be tuned to have
generally bigger healthcheck timeouts.
Surya, could you please comment on Hans' theory on the 1.5 seconds.
References:
http://devel.opensaf.org/ticket/2611
http://list.opensaf.org/pipermail/devel/2012-April/022254.html
Workaround:
LOG server throws away "rotten" messages that are older than 10 sec.
Proposed long term solution:
MDS should buffer incoming data messages until the corresponding SVC up message
is received and potentially delivered to the client.
Changed 8 months ago by mathi ¶
I mean, if we try to formulate and understand the problem....
If the problem is health check timeouts we should do the following
•increase the timeout for healthcheck, and
•if necessary, introduce a separate healthcheck thread.
If the problem is about clients' receiving retry, then these situations would
occur typically when the shared filesystem is/was undergoing a role change or
is in the process of some heavy sync operation, etc. In such situations,
returning TRY_AGAIN is a genuine way of handling such situations (typically
these situations can occur only during an upgrade kind of scenario that might
involve role change or when some fault at the disk level and not during normal
lifecycle when the healthchecks.)
If the problem is timeout that which is caused by the slow processing, then we
could think of introducing some protocol between the LGA and LGS to improve the
congestion, i mean i'm tending to think in this angle, the end solution may
involve LGA, LGS or even MDS but i think the problems being describe here would
have occurred even without the 2611 and as such 2611 cannot contribute much to
this problem getting formulated in this ticket.
Having said that, throwing away older messages shouldn't be a problem, but i'm
trying to understand how could that improve the situation...
Changed 7 months ago by nagendra ¶
■owner changed from surya to nagendra
■status changed from new to accepted
Changed 7 months ago by nagendra ¶
■owner changed from nagendra to surya
■status changed from accepted to assigned
Changed 7 months ago by surya ¶
■status changed from assigned to accepted
Changed 7 months ago by surya ¶
■patch_waiting changed from no to yes
Changed 7 months ago by mahesh ¶
Steps to test:
1)Pause osaflogd process (# kill -STOP <osaflogd PID> )
2)Write to system stream using saflogger tool(#/usr/local/bin/saflogger -y
"Out of ourder test" )
3)Allow saflogger saLogInitialize FAILED
4)Continues a stopped osaflogd process (#kill -CONT <osaflogd PID> )
5)Observer mds_mdtm_query_dest_tipc() logs ( Current patch dosne have log ,
so need add some syslog in mds_mdtm_query_dest_tipc()
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets