[ 
https://issues.apache.org/jira/browse/SENTRY-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319585#comment-15319585
 ] 

Colin Patrick McCabe edited comment on SENTRY-872 at 6/7/16 10:36 PM:
----------------------------------------------------------------------

Thanks for checking out the design doc!

Just to be clear, HIVE-7973 has been committed to Hive already.  In fact, it is 
already a part of Cloudera's CDH5.8 distribution.  While it's true that there 
are a few open subtasks remaining on the upstream JIRA, the same could be said 
for almost any Hadoop feature.  We always have plans to improve things :)  We 
are planning on using HIVE-7973 for other things besides Sentry HA-- for 
example, it is useful for replicating the Hive database.  That code will 
receive additional testing and attention due to the other uses that it's being 
put to.  When using HIVE-7973, it doesn't matter which HMS process we talk to-- 
both of them have access to the notification log stored in SQL.  This allows us 
to see what is going on in Hive, and exactly what order it occurred in, even 
when there are multiple HMS processes involved-- something we currently cannot 
do.

With an active/active design, all the sentry daemons would have to request 
updates (or be sent updates) from the HMS.  This is inefficient because it 
multiplies the RPC load on the HMS service.  It is especially inefficient if we 
have 3 sentry daemons (for extra redundancy).  It opens the door to divergence 
between sentry daemons, because some of the sentry daemons might receive 
updates from HMS earlier or later due to network conditions.  If we are 
persisting the HMS updates in the Sentry SQL database, we must somehow choose 
which sentry daemon does the persisting.  They can't all do it, because their 
updates would conflict.  Choosing one sentry daemon to do the persistence is 
essentially equivalent to choosing a master.

The update log is useful for more than just implementing HA.  It can be used as 
a generalized mechanism for synchronizing a cache.  For example, the HDFS 
plugin can read the update log and apply its updates to keep the cache 
maintained in the NameNode process in sync with what is going on in Sentry.  
This is better than the current mechanism of buffering "deltas" in memory in 
the sentry daemon.  The delta mechanism requires lots of heap memory, whereas 
the update log mechanism does not.  Because the update log is stored in the SQL 
database, the HDFS plugin will be able to continue requesting update log 
entries even if the sentry service is restarted or has a failover.  In 
contrast, the deltas buffered in memory will be lost if either of those events 
occur.  So in conclusion I would say that we do agree that sentry should move 
towards becoming stateless, and we view this design as a stepping stone towards 
that.


was (Author: cmccabe):
Thanks for checking out the design doc!

Just to be clear, HIVE-7973 has been committed to Hive already.  In fact, it is 
already a part of Cloudera's CDH5.8 distribution.  While it's true that there 
are a few open subtasks remaining on the upstream JIRA, the same could be said 
for almost any Hadoop feature.  We always have plans to improve things :)  We 
are planning on using HIVE-7973 for other things besides Sentry HA-- for 
example, it is useful for replicating the Hive database.  That code will 
receive additional testing and attention due to the other uses that it's being 
put to.  When using HIVE-7973, it doesn't matter which HMS process we talk to-- 
both of them have access to the notification log stored in SQL.  This allows us 
to see what is going on in Hive, and exactly what order it occurred in, even 
when there are multiple HMS processes involved-- something we currently cannot 
do.

With an active/active design, all the sentry daemons would have to request 
updates (or be sent updates) from the HMS.  This is inefficient because it 
multiplies the RPC load on the HMS service.  It is especially inefficient if we 
have 3 sentry daemons (for extra redundancy).  It opens the door to divergence 
between sentry daemons, because some of the sentry daemons might receive 
updates from HMS earlier or later due to network conditions.  If we are 
persisting the HMS updates in the Sentry SQL database, we must somehow choose 
which sentry daemon does the persisting.  They can't all do it, because their 
updates would conflict.  Choosing one sentry daemon to do the persistence is 
essentially equivalent to choosing a master

The update log is useful for more than just implementing HA.  It can be used as 
a generalized mechanism for synchronizing a cache.  For example, the HDFS 
plugin can read the update log and apply its updates to keep the cache 
maintained in the NameNode process in sync with what is going on in Sentry.  
This is better than the current mechanism of buffering "deltas" in memory in 
the sentry daemon.  The delta mechanism requires lots of heap memory, whereas 
the update log mechanism does not.  Because the update log is stored in the SQL 
database, the HDFS plugin will be able to continue requesting update log 
entries even if the sentry service is restarted or has a failover.  In 
contrast, the deltas buffered in memory will be lost if either of those events 
occur.  So in conclusion I would say that we do agree that sentry should move 
towards becoming stateless, and we view this design as a stepping stone towards 
that.

> Uber jira for HMS HA + Sentry HA redesign
> -----------------------------------------
>
>                 Key: SENTRY-872
>                 URL: https://issues.apache.org/jira/browse/SENTRY-872
>             Project: Sentry
>          Issue Type: Improvement
>          Components: Hdfs Plugin
>    Affects Versions: 1.5.0
>            Reporter: Sravya Tirukkovalur
>            Assignee: Sravya Tirukkovalur
>             Fix For: 1.8.0
>
>         Attachments: SENTRY-872.0.patch, SENTRY-872.pdf, SENTRY-872_design.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to