+1 (binding)

--Chris Nauroth




On 10/23/15, 7:11 AM, "Manoharan, Arun" <armanoha...@ebay.com> wrote:

>Hello Everyone,
>
>Thanks for all the feedback on the Eagle Proposal.
>
>I would like to call for a [VOTE] on Eagle joining the ASF as an
>incubation project.
>
>The vote is open for 72 hours:
>
>[ ] +1 accept Eagle in the Incubator
>[ ] ±0
>[ ] -1 (please give reason)
>
>Eagle is a Monitoring solution for Hadoop to instantly identify access to
>sensitive data, recognize attacks, malicious activities and take actions
>in real time. Eagle supports a wide variety of policies on HDFS data and
>Hive. Eagle also provides machine learning models for detecting anomalous
>user behavior in Hadoop.
>
>The proposal is available on the wiki here:
>https://wiki.apache.org/incubator/EagleProposal
>
>The text of the proposal is also available at the end of this email.
>
>Thanks for your time and help.
>
>Thanks,
>Arun
>
><COPY of the proposal in text format>
>
>Eagle
>
>Abstract
>Eagle is an Open Source Monitoring solution for Hadoop to instantly
>identify access to sensitive data, recognize attacks, malicious
>activities in hadoop and take actions.
>
>Proposal
>Eagle audits access to HDFS files, Hive and HBase tables in real time,
>enforces policies defined on sensitive data access and alerts or blocks
>user¹s access to that sensitive data in real time. Eagle also creates
>user profiles based on the typical access behaviour for HDFS and Hive and
>sends alerts when anomalous behaviour is detected. Eagle can also import
>sensitive data information classified by external classification engines
>to help define its policies.
>
>Overview of Eagle
>Eagle has 3 main parts.
>1.Data collection and storage - Eagle collects data from various hadoop
>logs in real time using Kafka/Yarn API and uses HDFS and HBase for
>storage.
>2.Data processing and policy engine - Eagle allows users to create
>policies based on various metadata properties on HDFS, Hive and HBase
>data.
>3.Eagle services - Eagle services include policy manager, query service
>and the visualization component. Eagle provides intuitive user interface
>to administer Eagle and an alert dashboard to respond to real time alerts.
>
>Data Collection and Storage:
>Eagle provides programming API for extending Eagle to integrate any data
>source into Eagle policy evaluation framework. For example, Eagle hdfs
>audit monitoring collects data from Kafka which is populated from
>namenode log4j appender or from logstash agent. Eagle hive monitoring
>collects hive query logs from running job through YARN API, which is
>designed to be scalable and fault-tolerant. Eagle uses HBase as storage
>for storing metadata and metrics data, and also supports relational
>database through configuration change.
>
>Data Processing and Policy Engine:
>Processing Engine: Eagle provides stream processing API which is an
>abstraction of Apache Storm. It can also be extended to other streaming
>engines. This abstraction allows developers to assemble data
>transformation, filtering, external data join etc. without physically
>bound to a specific streaming platform. Eagle streaming API allows
>developers to easily integrate business logic with Eagle policy engine
>and internally Eagle framework compiles business logic execution DAG into
>program primitives of underlying stream infrastructure e.g. Apache Storm.
>For example, Eagle HDFS monitoring transforms audit log from Namenode to
>object and joins sensitivity metadata, security zone metadata which are
>generated from external programs or configured by user. Eagle hive
>monitoring filters running jobs to get hive query string and parses query
>string into object and then joins sensitivity metadata.
>Alerting Framework: Eagle Alert Framework includes stream metadata API,
>scalable policy engine framework, extensible policy engine framework.
>Stream metadata API allows developers to declare event schema including
>what attributes constitute an event, what is the type for each attribute,
>and how to dynamically resolve attribute value in runtime when user
>configures policy. Scalable policy engine framework allows policies to be
>executed on different physical nodes in parallel. It is also used to
>define your own policy partitioner class. Policy engine framework
>together with streaming partitioning capability provided by all streaming
>platforms will make sure policies and events can be evaluated in a fully
>distributed way. Extensible policy engine framework allows developer to
>plugin a new policy engine with a few lines of codes. WSO2 Siddhi CEP
>engine is the policy engine which Eagle supports as first-class citizen.
>Machine Learning module: Eagle provides capabilities to define user
>activity patterns or user profiles for Hadoop users based on the user
>behaviour in the platform. These user profiles are modeled using Machine
>Learning algorithms and used for detection of anomalous users activities.
>Eagle uses Eigen Value Decomposition, and Density Estimation algorithms
>for generating user profile models. The model reads data from HDFS audit
>logs, preprocesses and aggregates data, and generates models using Spark
>programming APIs. Once models are generated, Eagle uses stream processing
>engine for near real-time anomaly detection to determine if any user¹s
>activities are suspicious or not.
>
>Eagle Services:
>Query Service: Eagle provides SQL-like service API to support
>comprehensive computation for huge set of data on the fly, for e.g.
>comprehensive filtering, aggregation, histogram, sorting, top,
>arithmetical expression, pagination etc. HBase is the data storage which
>Eagle supports as first-class citizen, relational database is supported
>as well. For HBase storage, Eagle query framework compiles user provided
>SQL-like query into HBase native filter objects and execute it through
>HBase coprocessor on the fly.
>Policy Manager: Eagle policy manager provides UI and Restful API for user
>to define policy with just a few clicks. It includes site management UI,
>policy editor, sensitivity metadata import, HDFS or Hive sensitive
>resource browsing, alert dashboards etc.
>Background
>Data is one of the most important assets for today¹s businesses, which
>makes data security one of the top priorities of today¹s enterprises.
>Hadoop is widely used across different verticals as a big data repository
>to store this data in most modern enterprises.
>At eBay we use hadoop platform extensively for our data processing needs.
>Our data in Hadoop is becoming bigger and bigger as our user base is
>seeing an exponential growth. Today there are variety of data sets
>available in Hadoop cluster for our users to consume. eBay has around 120
>PB of data stored in HDFS across 6 different clusters and around 1800+
>active hadoop users consuming data thru Hive, HBase and mapreduce jobs
>everyday to build applications using this data. With this astronomical
>growth of data there are also challenges in securing sensitive data and
>monitoring the access to this sensitive data. Today in large
>organizations HDFS is the defacto standard for storing big data. Data
>sets which includes and not limited to consumer sentiment, social media
>data, customer segmentation, web clicks, sensor data, geo-location and
>transaction data get stored in Hadoop for day to day business needs.
>We at eBay want to make sure the sensitive data and data platforms are
>completely protected from security breaches. So we partnered very closely
>with our Information Security team to understand the requirements for
>Eagle to monitor sensitive data access on hadoop:
>1.Ability to identify and stop security threats in real time
>2.Scale for big data (Support PB scale and Billions of events)
>3.Ability to create data access policies
>4.Support multiple data sources like HDFS, HBase, Hive
>5.Visualize alerts in real time
>6.Ability to block malicious access in real time
>We did not find any data access monitoring solution that available today
>and can provide the features and functionality that we need to monitor
>the data access in the hadoop ecosystem at our scale. Hence with an
>excellent team of world class developers and several users, we have been
>able to bring Eagle into production as well as open source it.
>
>Rationale
>In today¹s world; data is an important asset for any company. Businesses
>are using data extensively to create amazing experiences for users. Data
>has to be protected and access to data should be secured from security
>breaches. Today Hadoop is not only used to store logs but also stores
>financial data, sensitive data sets, geographical data, user click stream
>data sets etc. which makes it more important to be protected from
>security breaches. To secure a data platform there are multiple things
>that need to happen. One is having a strong access control mechanism
>which today is provided by Apache Ranger and Apache Sentry. These tools
>provide the ability to provide fine grain access control mechanism to
>data sets on hadoop. But there is a big gap in terms of monitoring all
>the data access events and activities in order to securing the hadoop
>data platform. Together with strong access control, perimeter security
>and data access monitoring in place data in the hadoop clusters can be
>secured against breaches. We looked around and found following:
>Existing data activity monitoring products are designed for traditional
>databases and data warehouse. Existing monitoring platforms cannot scale
>out to support fast growing data and petabyte scale. Few products in the
>industry are still very early in terms of supporting HDFS, Hive, HBase
>data access monitoring.
>As mentioned in the background, the business requirement and urgency to
>secure the data from users with malicious intent drove eBay to invest in
>building a real time data access monitoring solution from scratch to
>offer real time alerts and remediation features for malicious data access.
>With the power of open source distributed systems like Hadoop, Kafka and
>much more we were able to develop a data activity monitoring system that
>can scale, identify and stop malicious access in real time.
>Eagle allows admins to create standard access policies and rules for
>monitoring HDFS, Hive and HBase data. Eagle also provides out of box
>machine learning models for modeling user profiles based on user access
>behaviour and use the model to alert on anomalies.
>
>Current Status
>
>Meritocracy
>Eagle has been deployed in production at eBay for monitoring billions of
>events per day from HDFS and Hive operations. From the start; the product
>has been built with focus on high scalability and application
>extensibility in mind and Eagle has demonstrated great performance in
>responding to suspicious events instantly and great flexibility in
>defining policy.
>
>Community
>Eagle seeks to develop the developer and user communities during
>incubation.
>
>Core Developers
>Eagle is currently being designed and developed by engineers from eBay
>Inc. ­ Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang,
>Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of
>these core developers have deep expertise in developing monitoring
>products for the Hadoop ecosystem.
>
>Alignment
>The ASF is a natural host for Eagle given that it is already the home of
>Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
>projects. Eagle leverages lot of Apache open-source products. Eagle was
>designed to offer real time insights into sensitive data access by
>actively monitoring the data access on various data sets in hadoop and an
>extensible alerting framework with a powerful policy engine. Eagle
>compliments the existing Hadoop platform area by providing a
>comprehensive monitoring and alerting solution for detecting sensitive
>data access threats based on preset policies and machine learning models
>for user behaviour analysis.
>
>Known Risks
>
>Orphaned Products
>The core developers of Eagle team work full time on this project. There
>is no risk of Eagle getting orphaned since eBay is extensively using it
>in their production Hadoop clusters and have plans to go beyond hadoop.
>For example, currently there are 7 hadoop clusters and 2 of them are
>being monitored using Hadoop Eagle in production. We have plans to extend
>it to all hadoop clusters and eventually other data platforms. There are
>10¹s of policies onboarded and actively monitored with plans to onboard
>more use case. We are very confident that every hadoop cluster in the
>world will be monitored using Eagle for securing the hadoop ecosystem by
>actively monitoring for data access on sensitive data. We plan to extend
>and diversify this community further through Apache. We presented Eagle
>at the hadoop summit in china and garnered interest from different
>companies who use hadoop extensively.
>
>Inexperience with Open Source
>The core developers are all active users and followers of open source.
>They are already committers and contributors to the Eagle Github project.
>All have been involved with the source code that has been released under
>an open source license, and several of them also have experience
>developing code in an open source environment. Though the core set of
>Developers do not have Apache Open Source experience, there are plans to
>onboard individuals with Apache open source experience on to the project.
>Apache Kylin PMC members are also in the same ebay organization. We work
>very closely with Apache Ranger committers and are looking forward to
>find meaningful integrations to improve the security of hadoop platform.
>
>Homogenous Developers
>The core developers are from eBay. Today the problem of monitoring data
>activities to find and stop threats is a universal problem faced by all
>the businesses. Apache Incubation process encourages an open and diverse
>meritocratic community. Eagle intends to make every possible effort to
>build a diverse, vibrant and involved community and has already received
>substantial interest from various organizations.
>
>Reliance on Salaried Developers
>eBay invested in Eagle as the monitoring solution for Hadoop clusters and
>some of its key engineers are working full time on the project. In
>addition, since there is a growing need for securing sensitive data
>access we need a data activity monitoring solution for Hadoop, we look
>forward to other Apache developers and researchers to contribute to the
>project. Additional contributors, including Apache committers have plans
>to join this effort shortly. Also key to addressing the risk associated
>with relying on Salaried developers from a single entity is to increase
>the diversity of the contributors and actively lobby for Domain experts
>in the security space to contribute. Eagle intends to do this.
>
>Relationships with Other Apache Products
>Eagle has a strong relationship and dependency with Apache Hadoop, HBase,
>Spark, Kafka and Storm. Being part of Apache¹s Incubation community,
>could help with a closer collaboration among these projects and as well
>as others. An Excessive Fascination with the Apache Brand Eagle is
>proposing to enter incubation at Apache in order to help efforts to
>diversify the committer-base, not so much to capitalize on the Apache
>brand. The Eagle project is in production use already inside eBay, but is
>not expected to be an eBay product for external customers. As such, the
>Eagle project is not seeking to use the Apache brand as a marketing tool.
>
>Documentation
>Information about Eagle can be found at https://github.com/eBay/Eagle.
>The following link provide more information about Eagle
>http://goeagle.io<http://goeagle.io/>.
>
>Initial Source
>Eagle has been under development since 2014 by a team of engineers at
>eBay Inc. It is currently hosted on Github.com under an Apache license
>2.0 at https://github.com/eBay/Eagle. Once in incubation we will be
>moving the code base to apache git library.
>
>External Dependencies
>Eagle has the following external dependencies.
>Basic
>€JDK 1.7+
>€Scala 2.10.4
>€Apache Maven
>€JUnit
>€Log4j
>€Slf4j
>€Apache Commons
>€Apache Commons Math3
>€Jackson
>€Siddhi CEP engine
>
>Hadoop
>€Apache Hadoop
>€Apache HBase
>€Apache Hive
>€Apache Zookeeper
>€Apache Curator
>
>Apache Spark
>€Spark Core Library
>
>REST Service
>€Jersey
>
>Query
>€Antlr
>
>Stream processing
>€Apache Storm
>€Apache Kafka
>
>Web
>€AngularJS
>€jQuery
>€Bootstrap V3
>€Moment JS
>€Admin LTE
>€html5shiv
>€respond
>€Fastclick
>€Date Range Picker
>€Flot JS
>
>Cryptography
>Eagle will eventually support encryption on the wire. This is not one of
>the initial goals, and we do not expect Eagle to be a controlled export
>item due to the use of encryption. Eagle supports but does not require
>the Kerberos authentication mechanism to access secured Hadoop services.
>
>Required Resources
>
>Mailing List
>€eagle-private for private PMC discussions
>€eagle-dev for developers
>€eagle-commits for all commits
>€eagle-users for all eagle users
>
>Subversion Directory
>€Git is the preferred source control system.
>
>Issue Tracking
>€JIRA Eagle (Eagle)
>
>Other Resources
>The existing code already has unit tests so we will make use of existing
>Apache continuous testing infrastructure. The resulting load should not
>be very large.
>
>Initial Committers
>€Seshu Adunuthula <sadunuthula at ebay dot com>
>€Arun Manoharan <armanoharan at ebay dot com>
>€Edward Zhang <yonzhang at ebay dot com>
>€Hao Chen <hchen9 at ebay dot com>
>€Chaitali Gupta <cgupta at ebay dot com>
>€Libin Sun <libsun at ebay dot com>
>€Jilin Jiang <jiljiang at ebay dot com>
>€Qingwen Zhao <qingwzhao at ebay dot com>
>€Hemanth Dendukuri <hdendukuri at ebay dot com>
>€Senthil Kumar <senthilkumar at ebay dot com>
>
>
>Affiliations
>The initial committers are employees of eBay Inc.
>
>Sponsors
>
>Champion
>€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>
>Nominated Mentors
>€Owen O¹Malley < omalley at apache dot org > - Apache IPMC member,
>Hortonworks
>€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>€Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
>Hortonworks
>€Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache IPMC
>member
>€Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member,
>Hortonworks
>
>Sponsoring Entity
>We are requesting the Incubator to sponsor this project.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to