Re: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Avik Dey Mon, 25 Feb 2013 17:19:09 -0800

[thanks appreciate your doing that, the announcement itself was
cross-posted as outreach]


Thanks Cos.

As I see the work currently, I believe most, if not all of these, will be
work against JIRAs in individual projects similar to the JIRAs posted here
https://github.com/intel-hadoop/project-rhino. If we get to a point where
some of the future work needs a home outside of the individual projects,
happy to incubate that work in Apache.

~avik



On Mon, Feb 25, 2013 at 4:18 PM, Konstantin Boudnik <c...@apache.org> wrote:

> [yanking away most of the cross-posts...]
>
> An interesting cross component project Avik. Any plans to incubate it in
> Apache?
>
> Cos
>
> On Mon, Feb 25, 2013 at 11:46PM, Dey, Avik wrote:
> > Project Rhino
> >
> > As the Apache Hadoop ecosystem extends into new markets and sees new use
> > cases with security and compliance challenges, the benefits of processing
> > sensitive and legally protected data with Hadoop must be coupled with
> > protection for private information that limits performance impact.
> Project
> > Rhino<https://github.com/intel-hadoop/project-rhino/> is our open source
> > effort to enhance the existing data protection capabilities of the Hadoop
> > ecosystem to address these challenges, and contribute the code back to
> > Apache.
> >
> > The core of the Apache Hadoop ecosystem as it is commonly understood is:
> >
> > - Core: A set of shared libraries
> > - HDFS: The Hadoop filesystem
> > - MapReduce: Parallel computation framework
> > - ZooKeeper: Configuration management and coordination
> > - HBase: Column-oriented database on HDFS
> > - Hive: Data warehouse on HDFS with SQL-like access
> > - Pig: Higher-level programming language for Hadoop computations
> > - Oozie: Orchestration and workflow management
> > - Mahout: A library of machine learning and data mining algorithms
> > - Flume: Collection and import of log and event data
> > - Sqoop: Imports data from relational databases
> >
> > These components are all separate projects and therefore cross cutting
> concerns like authN, authZ, a consistent security policy framework,
> consistent authorization model and audit coverage are loosely coordinated.
> Some security features expected by our customers, such as encryption, are
> simply missing. Our aim is to take a full stack view and work with the
> individual projects toward consistent concepts and capabilities, filling
> gaps as we go.
> >
> > Our initial goals are:
> >
> > 1) Framework support for encryption and key management
> >
> > There is currently no framework support for encryption or key
> management. We will add this support into Hadoop Core and integrate it
> across the ecosystem.
> >
> > 2) A common authorization framework for the Hadoop ecosystem
> >
> > Each component currently has its own authorization engine. We will
> abstract the common functions into a reusable authorization framework with
> a consistent interface. Where appropriate we will either modify an existing
> engine to work within this framework, or we will plug in a common default
> engine. Therefore we also must normalize how security policy is expressed
> and applied by each component. Core, HDFS, ZooKeeper, and HBase currently
> support simple access control lists (ACLs) composed of users and groups. We
> see this as a good starting point. Where necessary we will modify
> components so they each offer equivalent functionality, and build support
> into others.
> >
> > 3) Token based authentication and single sign on
> >
> > Core, HDFS, ZooKeeper, and HBase currently support Kerberos
> authentication at the RPC layer, via SASL. However this does not provide
> valuable attributes such as group membership, classification level,
> organizational identity, or support for user defined attributes. Hadoop
> components must interrogate external resources for discovering these
> attributes and at scale this is problematic. There is also no consistent
> delegation model. HDFS has a simple delegation capability, and only Oozie
> can take limited advantage of it. We will implement a common token based
> authentication framework to decouple internal user and service
> authentication from external mechanisms used to support it (like Kerberos).
> >
> > 4) Extend HBase support for ACLs to the cell level
> >
> > Currently HBase supports setting access controls at the table or column
> family level. However, many use cases would benefit from the additional
> capability to do this on a per cell basis. In fact for many users dealing
> with sensitive information the ability to do this is crucial.
> >
> > 5) Improve audit logging
> >
> > Audit messages from various Hadoop components do not use a unified or
> even consistently formatted format. This makes analysis of logs for
> verifying compliance or taking corrective action difficult. We will build a
> common audit logging facility as part of the common authorization framework
> work. We will also build a set of common audit log processing tools for
> transforming them to different industry standard formats, for supporting
> compliance verification, and for triggering responses to policy violations.
> >
> > Current JIRAs:
> >
> > As part of this ongoing effort we are contributing our work to-date
> against the JIRAs listed below. As you may appreciate, the goals for
> Project Rhino covers a number of different Apache projects, the scope of
> work is significant and likely to only increase as we get additional
> community input. We also appreciate that there may be others in the Apache
> community that may be working on some of this or are interested in
> contributing to it. If so, we look forward to partnering with you in Apache
> to accelerate this effort so the Apache community can see the benefits from
> our collective efforts sooner. You can also find a more detailed version of
> this announcement at Project Rhino<
> https://github.com/intel-hadoop/project-rhino/>.
> >
> > Please feel free to reach out to us by commenting on the JIRAs below:
> >
> > HBASE-6222: Add per-KeyValue Security<
> https://issues.apache.org/jira/browse/hbase-6222>
> >
> > HADOOP-9331: Hadoop crypto codec framework and crypto codec
> implementations<https://issues.apache.org/jira/browse/hadoop-9331> and
> related sub-tasks
> >
> > MAPREDUCE-5025: Key Distribution and Management for supporting crypto
> codec in Map Reduce<https://issues.apache.org/jira/browse/mapreduce-5025>
> and related JIRAs
> >
> > HBASE-7544: Transparent table/CF encryption<
> https://issues.apache.org/jira/browse/hbase-7544>
> >
>

Re: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Reply via email to