[
https://issues.apache.org/jira/browse/SENTRY-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vadim Spector updated SENTRY-1866:
----------------------------------
Description:
Motivation: can think of several, but the immediate ones are:
a) logging Sentry server unavailability on client side. With multiple active
connections to Sentry server, logging each failed RPC call (currently at DEBUG
level) to the same Sentry server that went down can be redundant and way too
much. It can also be misleading, because there is no mandatory link between
when connection was established and when an attempt to use it has failed, so we
can report failures of the old connections.
b) enabling optimization of connection pooling. Ping RPC call would most likely
fail due to server inavailability (crash, restart ..), so it can be temporarily
marked as unavailable, so no new connection attempts are made within some
configurable time interval (say, 1 sec).
was:
Motivation: can think of several, but the immediate ones are:
a) logging Sentry server unavailability on client side. With multiple active
connections to Sentry server, logging each failed RPC call (currently at DEBUG
level) to the same Sentry server that went down can be redundant and way too
much. It can also be misleading, because there is no mandatory link between
when connection was established and when an attempt to use it has failed, so we
can report failures of the old connections.
Sentry HA-specific: when the Sentry client fails over from one sentry server to
the other, it does not print a message that it has done so. Have such a client
print a simple, clear INFO level message when the client fails over form one
Sentry server to another.
Design considerations:
"Sentry client" stands for a specific class instance capable of connecting to a
specific Sentry server instance from some app (usually another Hadoop service).
In HA scenario, Sentry client relies on connection pooling (SentryTransportPool
class) to select one of several available configured Sentry server instances.
Whenever connection fails, Sentry client simply asks SentryTransportPool to a)
invalidate this specific connection and b) get another connection instead.
There is no monitoring of Sentry server liveliness per se. Each Sentry client
finds out about a failure independently and only at the time of trying to use
it. Thus there may be no particular correlation between the time of the
discovery of connection failure and the time Sentry server actually becomes
unavailable. E.g. a client can discover a failure of the old connection, long
after Sentry server crushed and then was restarted (and maybe restarted more
than once!).
Intuitively, one would like yto have a single log per Sentry server
crush/shutdown; but due to the explanations above, it seems difficult, if not
impossible, to group the connections by instance(s) of Sentry server when these
connections were initiated. Therefore, it may be challenging to say whether
multiple connection failures have to do with "the same" Sentry server instance
going down. Therefore, it is difficult to report exactly one connection failure
per one Sentry server shutdown/crush event.
Yet, the desire to have visibility into such events in the field is
understandable. At the same time, if we simply log every connection failure,
such logging can be massive - there may be many concurrent connections to
Sentry server(s) from the same app. Such logging would be less than useful.
The solution is required to use some less than perfect rules, by which the
number of connection failure logs can be contained. The alternative solution of
introducing periodic pinging of Sentry server and only logging pinging failures
would be possible as well (and it would be awesome if Sentry server responded
to pings with the server-id initialized as the server start time stamp - this
would totally solve the problem), but requires more radical changes.
The simplest solution seems to be as follows: since the recovery of the failed
Sentry serve is likely to take some time, we do not need to be too clever; it
may just be enough to report each connection failure to a given Sentry instance
no more often than once every N (configurable value) seconds. If one connection
failure to Sentry server instance X has been reported, another one won't be
reported before N seconds expire. This will keep the number of connection
failure messages at bay. Such logs may still be confusing, if a client attempts
to use some old connection from the old server instance after some idle period,
and after the problem has long been fixed, but this is arguably still better
than nothing.
Alternative suggestions are welcome.
> Add ping Thrift APIs for Sentry services
> ----------------------------------------
>
> Key: SENTRY-1866
> URL: https://issues.apache.org/jira/browse/SENTRY-1866
> Project: Sentry
> Issue Type: Improvement
> Reporter: Vadim Spector
>
> Motivation: can think of several, but the immediate ones are:
> a) logging Sentry server unavailability on client side. With multiple active
> connections to Sentry server, logging each failed RPC call (currently at
> DEBUG level) to the same Sentry server that went down can be redundant and
> way too much. It can also be misleading, because there is no mandatory link
> between when connection was established and when an attempt to use it has
> failed, so we can report failures of the old connections.
> b) enabling optimization of connection pooling. Ping RPC call would most
> likely fail due to server inavailability (crash, restart ..), so it can be
> temporarily marked as unavailable, so no new connection attempts are made
> within some configurable time interval (say, 1 sec).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)