[ 
https://issues.apache.org/jira/browse/SENTRY-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vadim Spector updated SENTRY-1866:
----------------------------------
    Description: 
Motivations: can think of several, but the immediate one is:

Sentry HA-specific: when the Sentry client fails over from one sentry server to 
the other, it does not print a message that it has done so. Have such a client 
print a simple, clear INFO level message when the client fails over form one 
Sentry server to another.

Design considerations:

"Sentry client" stands for a specific class instance capable of connecting to a 
specific Sentry server instance from some app (usually another Hadoop service). 
In HA scenario, Sentry client relies on connection pooling (SentryTransportPool 
class) to select one of several available configured Sentry server instances. 
Whenever connection fails, Sentry client simply asks SentryTransportPool to a) 
invalidate this specific connection and b) get another connection instead. 
There is no monitoring of Sentry server liveliness per se. Each Sentry client 
finds out about a failure independently and only at the time of trying to use 
it. Thus there may be no particular correlation between the time of the 
discovery of connection failure and the time Sentry server actually becomes 
unavailable. E.g. a client can discover a failure of the old connection, long 
after Sentry server crushed and then was restarted (and maybe restarted more 
than once!).

Intuitively, one would like yto have a single log per Sentry server 
crush/shutdown; but due to the explanations above, it seems difficult, if not 
impossible, to group the connections by instance(s) of Sentry server when these 
connections were initiated. Therefore, it may be challenging to say whether 
multiple connection failures have to do with "the same" Sentry server instance 
going down. Therefore, it is difficult to report exactly one connection failure 
per one Sentry server shutdown/crush event.

Yet, the desire to have visibility into such events in the field is 
understandable. At the same time, if we simply log every connection failure, 
such logging can be massive - there may be many concurrent connections to 
Sentry server(s) from the same app. Such logging would be less than useful.

The solution is required to use some less than perfect rules, by which the 
number of connection failure logs can be contained. The alternative solution of 
introducing periodic pinging of Sentry server and only logging pinging failures 
would be possible as well (and it would be awesome if Sentry server responded 
to pings with the server-id initialized as the server start time stamp - this 
would totally solve the problem), but requires more radical changes.

The simplest solution seems to be as follows: since the recovery of the failed 
Sentry serve is likely to take some time, we do not need to be too clever; it 
may just be enough to report each connection failure to a given Sentry instance 
no more often than once every N (configurable value) seconds. If one connection 
failure to Sentry server instance X has been reported, another one won't be 
reported before N seconds expire. This will keep the number of connection 
failure messages at bay. Such logs may still be confusing, if a client attempts 
to use some old connection from the old server instance after some idle period, 
and after the problem has long been fixed, but this is arguably still better 
than nothing.

Alternative suggestions are welcome.

  was:
Sentry HA-specific: when the Sentry client fails over from one sentry server to 
the other, it does not print a message that it has done so. Have such a client 
print a simple, clear INFO level message when the client fails over form one 
Sentry server to another.

Design considerations:

"Sentry client" stands for a specific class instance capable of connecting to a 
specific Sentry server instance from some app (usually another Hadoop service). 
In HA scenario, Sentry client relies on connection pooling (SentryTransportPool 
class) to select one of several available configured Sentry server instances. 
Whenever connection fails, Sentry client simply asks SentryTransportPool to a) 
invalidate this specific connection and b) get another connection instead. 
There is no monitoring of Sentry server liveliness per se. Each Sentry client 
finds out about a failure independently and only at the time of trying to use 
it. Thus there may be no particular correlation between the time of the 
discovery of connection failure and the time Sentry server actually becomes 
unavailable. E.g. a client can discover a failure of the old connection, long 
after Sentry server crushed and then was restarted (and maybe restarted more 
than once!).

Intuitively, one would like yto have a single log per Sentry server 
crush/shutdown; but due to the explanations above, it seems difficult, if not 
impossible, to group the connections by instance(s) of Sentry server when these 
connections were initiated. Therefore, it may be challenging to say whether 
multiple connection failures have to do with "the same" Sentry server instance 
going down. Therefore, it is difficult to report exactly one connection failure 
per one Sentry server shutdown/crush event.

Yet, the desire to have visibility into such events in the field is 
understandable. At the same time, if we simply log every connection failure, 
such logging can be massive - there may be many concurrent connections to 
Sentry server(s) from the same app. Such logging would be less than useful.

The solution is required to use some less than perfect rules, by which the 
number of connection failure logs can be contained. The alternative solution of 
introducing periodic pinging of Sentry server and only logging pinging failures 
would be possible as well (and it would be awesome if Sentry server responded 
to pings with the server-id initialized as the server start time stamp - this 
would totally solve the problem), but requires more radical changes.

The simplest solution seems to be as follows: since the recovery of the failed 
Sentry serve is likely to take some time, we do not need to be too clever; it 
may just be enough to report each connection failure to a given Sentry instance 
no more often than once every N (configurable value) seconds. If one connection 
failure to Sentry server instance X has been reported, another one won't be 
reported before N seconds expire. This will keep the number of connection 
failure messages at bay. Such logs may still be confusing, if a client attempts 
to use some old connection from the old server instance after some idle period, 
and after the problem has long been fixed, but this is arguably still better 
than nothing.

Alternative suggestions are welcome.


> Add ping Thrift APIs for Sentry services
> ----------------------------------------
>
>                 Key: SENTRY-1866
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1866
>             Project: Sentry
>          Issue Type: Improvement
>            Reporter: Vadim Spector
>
> Motivations: can think of several, but the immediate one is:
> Sentry HA-specific: when the Sentry client fails over from one sentry server 
> to the other, it does not print a message that it has done so. Have such a 
> client print a simple, clear INFO level message when the client fails over 
> form one Sentry server to another.
> Design considerations:
> "Sentry client" stands for a specific class instance capable of connecting to 
> a specific Sentry server instance from some app (usually another Hadoop 
> service). In HA scenario, Sentry client relies on connection pooling 
> (SentryTransportPool class) to select one of several available configured 
> Sentry server instances. Whenever connection fails, Sentry client simply asks 
> SentryTransportPool to a) invalidate this specific connection and b) get 
> another connection instead. There is no monitoring of Sentry server 
> liveliness per se. Each Sentry client finds out about a failure independently 
> and only at the time of trying to use it. Thus there may be no particular 
> correlation between the time of the discovery of connection failure and the 
> time Sentry server actually becomes unavailable. E.g. a client can discover a 
> failure of the old connection, long after Sentry server crushed and then was 
> restarted (and maybe restarted more than once!).
> Intuitively, one would like yto have a single log per Sentry server 
> crush/shutdown; but due to the explanations above, it seems difficult, if not 
> impossible, to group the connections by instance(s) of Sentry server when 
> these connections were initiated. Therefore, it may be challenging to say 
> whether multiple connection failures have to do with "the same" Sentry server 
> instance going down. Therefore, it is difficult to report exactly one 
> connection failure per one Sentry server shutdown/crush event.
> Yet, the desire to have visibility into such events in the field is 
> understandable. At the same time, if we simply log every connection failure, 
> such logging can be massive - there may be many concurrent connections to 
> Sentry server(s) from the same app. Such logging would be less than useful.
> The solution is required to use some less than perfect rules, by which the 
> number of connection failure logs can be contained. The alternative solution 
> of introducing periodic pinging of Sentry server and only logging pinging 
> failures would be possible as well (and it would be awesome if Sentry server 
> responded to pings with the server-id initialized as the server start time 
> stamp - this would totally solve the problem), but requires more radical 
> changes.
> The simplest solution seems to be as follows: since the recovery of the 
> failed Sentry serve is likely to take some time, we do not need to be too 
> clever; it may just be enough to report each connection failure to a given 
> Sentry instance no more often than once every N (configurable value) seconds. 
> If one connection failure to Sentry server instance X has been reported, 
> another one won't be reported before N seconds expire. This will keep the 
> number of connection failure messages at bay. Such logs may still be 
> confusing, if a client attempts to use some old connection from the old 
> server instance after some idle period, and after the problem has long been 
> fixed, but this is arguably still better than nothing.
> Alternative suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to