Re: [DISCUSS] Top domains enrichment config/extractor management

Michael Miklavcic Fri, 24 Feb 2017 14:46:48 -0800

The reason I posed this question to the community is because I started to
recognize some of the shortcomings of doing this solely through Ambari, as
you and Nick have pointed out. I think an Ambari view over the management
UI is a great idea. And I'd love to see us provide a more robust mechanism
for loading these enrichments via the management UI. As you said, perhaps
Ambari could be used to manage the ZK config around active
enrichments/locations (the "USE" part of it) while the management UI is
used for actually loading and managing the enrichments themselves?



On Fri, Feb 24, 2017 at 8:12 AM, Casey Stella <[email protected]> wrote:

> Late to chime in here, but I feel that we have discussed Ambari's role
> before and I think we should probably clarify, as a community a few things
> with regards Ambari vs a management UI built around the REST PR currently
> under review.  (I promise, I will get to the topic at hand eventually ;) :
>
>    - Where functionality should live
>    - Who is responsible for what
>
> I will now make a couple (possibly controversial) statements (some of
> which) we have actually discussed prior to this on the dev list:
>
>
>    - I view Ambari as managing the install and the static configuration for
>    Metron.  For us, this would include zookeeper configs as well as
> topology
>    configuration.  This would be the persistent store of truth.
>    - I view Zookeeper to be our runtime configuration store for the
>    topologies.
>
>
>    - I view a management UI (and the Stellar Shell) as managing
>    functionality for interacting with the system.  Where it changes
>    configuration, it must go through Ambari.
>    - I believe the management UI should be exposed as an ambari view
>
> As such, I see the importation and management of enrichments, which is a
> data task, to be squarely in the purview of the management UI, whose job is
> the care and feeding of the data.  That being said, any configuration
> changes to USE the enrichment should at least be routed through ambari, but
> should be managed in the UI.
>
> Now the question becomes, should we have enrichment collateral (I'm
> including both hbase as well as geo or anything else we have) loaded at
> install-time.  I would argue that we should not.  Rather, we should design
> the management UI so that the enrichments can be added easily, with a
> wizard to enable the use of the enrichment via stellar for a sensor
>
> On that topic, I think we are doing too much as part of our install.  I
> would argue that we shouldn't pre-load even the geo data or depend on it
> for the default parsers.
>
> Casey
>
>
>
> On Tue, Feb 21, 2017 at 6:31 PM, Michael Miklavcic <
> [email protected]> wrote:
>
> > With the work committed in
> > https://github.com/apache/incubator-metron/pull/445 and
> > https://github.com/apache/incubator-metron/pull/432, we now have a
> robust
> > and flexible means to import enrichment sources and transform their
> > contents as they are inserted into HBase. One of the main motivators for
> > this new functionality was to add the ability to load top domain rankings
> > from sources such as Alexa. The proposal is to make this type of
> enrichment
> > a top-level feature in Metron by introducing it to the Ambari management
> UI
> > as a configurable set of properties in the MPack install. This comes with
> > some options and challenges in how we want to manage the configurations,
> > which I will outline below.
> >
> > *Use cases:*
> >
> >    - Single load of top domains file
> >    - Re-loading top domains file - need to be able to cleanup properly
> >    - Cleaning up/deleting old enrichment data (this is a general feature
> >    that we currently lack - I think it is worth a separate Jira/PR for
> >    creating a MapReduce job that enables cleanup to occur).
> >    - Modifying default top domains file source - there are other options
> >    besides Alexa. And users may want to load a file from local URI since
> > many
> >    data centers do not have direct access to the internet.
> >    - Ability to modify the default extractor config JSON and tune the
> >    Stellar transformations for both the value and indicator transforms.
> > Allows
> >    more flexible handling of data based on other sources.
> >    - Loading multiple top domains source enrichments. (Maybe a separate
> PR
> >    for this if we even think it would be useful)
> >    - Updating the top domain enrichment - This needs to be an atomic
> >    operation in order to prevent incorrect data.
> >    - Rolling back to an older version of the top domains enrichment. Also
> >    needs to be atomic.
> >    - Ability to schedule an enrichment load on schedule - we would like
> to
> >    defer this to an external scheduling mechanism, e.g. cron or Control
> M.
> > The
> >    enrichment loading system should have the necessary features to enable
> > this
> >    type of automation without data integrity issues.
> >
> > *Considerations:*
> >
> >    - As mentioned above, we want to add this feature to the Ambari MPack.
> >    This requires at least 2 parameters to work. We need the ability to
> > specify
> >    a URI as well as an extractor config.
> >    - How do we want to manage the extractor config? The most obvious
> >    solution is to provide a text field in Ambari with a default JSON
> > config.
> >    When a load is initiated, Ambari would place a fresh copy of the
> > extractor
> >    config in the /tmp/ directory. This is an ephemeral file that isn't
> > needed
> >    other than during a load.
> >    - It seems easy enough to have the load occur during the initial
> >    install, however subsequent loads would require a different workflow.
> > How
> >    do folks feel about adding a set of dropdown options in the Ambari UI
> > for
> >    loading, updating, and deleting the top domains enrichment? I believe
> we
> >    are doing something similar for the ElasticSearch templates currently.
> >    - In the case of atomic operations for updates and rollbacks, I
> propose
> >    we add a property to Zookeeper that is reference-able in the
> enrichment
> >    itself. The idea would be to create a "top-domains" property in ZK
> that
> >    points to an enrichment key with a load timestamp associated with it,
> > e.g.
> >    top-domains_20170221042000. This would also allow a mapreduce job to
> be
> >    written that cleans up old enrichments. Another option is to create a
> > new
> >    table in HBase if/when you update the enrichment and change the
> > enrichment
> >    config manually. Deleting an old enrichment would simply be a matter
> of
> >    dropping the table in HBase. A relevant discussion of the tradeoffs of
> >    having many small tables versus 1 large table can be found here -
> >    http://grokbase.com/t/hbase/user/11bjbdw94q/multiple-
> > tables-vs-big-fat-table
> >    - In order to update or rollback an enrichment as mentioned above, we
> >    would also ideally provide a mechanism for changing the rowkey pointed
> > to
> >    by the enrichment.
> >
> > In summary of the use cases and considerations above, this boils down to
> > how we'd like to leverage Ambari here. Do we want Ambari to handle only
> the
> > initial install/load and have end users be responsible on an ongoing
> basis
> > for updates (users would be responsible for copying or distributing the
> > extractor_config.json for instance), or do we want to enable Ambari to
> > manage the configuration ongoing and enable functionality for reloading,
> > updating, and rollback?
> >
> > Best,
> > Mike
> >
>

Re: [DISCUSS] Top domains enrichment config/extractor management

Reply via email to