[ 
https://issues.apache.org/jira/browse/SENTRY-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124312#comment-16124312
 ] 

Na Li commented on SENTRY-1883:
-------------------------------

Comments from Brian

>
>    - Rather then streaming huge snapshots in a single message we should
>    provide streaming protocol with smaller messages and later reassembly on
>    the HDFS side.
>
> [bt] If we are going to keep with the current flow then I think this is
going to be one of the better options.  Multiple calls chunking out the
paths to some optimized number per thrift call (like 1k) with a "thats all
folks" call would allow the structure to be assembled on the HDFS side.

But I feel that we really don't need to send all of the data over to the
HDFS side immediately,  I feel we could make on demand calls (maybe on a
per directory basis) by the HDFS client to Sentry which then populates a
cache on the HDFS node side. Updates would still being pushed as they
occur, hense a slow loading of the cache. Yes this would slow down initial
access on the first call, but since this is for direct HDFS managed and
served paths that initial call slow down would be fairly negligible
assuming the Sentry turn around for the call would be fairly easy.  This is
essentially what Hive does without the cache and the update I believe.




>
>    - Most of the information passed are long strings with common
>    prefixes. We should be able to apply simple compression techniques (e.g.
>    prefix compression) or even run a full compression on the data before
>    sending.
>
>
I think if we are thinking this we should really look at passing a true
tree structure instead of trying to compress the data outright.  If its a
tree structure each part if only listed once in its place in the tree.


>
>    - We should consider using non-thrift data structures for passing the
>    info and just use Thrift as a transport mechanism.
>
> Im not sure why we would break protocol compatibility with something
custom.  I feel we can work around this.  Im not convinced we can, but i
think this should be a last resort.




> Optimizing Sentry to HDFS protocol
> ----------------------------------
>
>                 Key: SENTRY-1883
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1883
>             Project: Sentry
>          Issue Type: Improvement
>            Reporter: Na Li
>            Priority: Minor
>
> Currently Sentry uses serialized Thrift structures to send a lot of
> information from the Sentry Server to the HDFS namenode plugin for the HDFS
> sync.
> We should think of ways to optimize this protocol in several ways:
>    - Rather then streaming huge snapshots in a single message we should
>    provide streaming protocol with smaller messages and later reassembly on
>    the HDFS side.
>    - Most of the information passed are long strings with common prefixes.
>    We should be able to apply simple compression techniques (e.g. prefix
>    compression) or even run a full compression on the data before sending.
>    - We should consider using non-thrift data structures for passing the
>    info and just use Thrift as a transport mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to