Re: [DISCUSS] Top domains enrichment config/extractor management

Casey Stella Fri, 24 Feb 2017 07:13:28 -0800

Late to chime in here, but I feel that we have discussed Ambari's role
before and I think we should probably clarify, as a community a few things
with regards Ambari vs a management UI built around the REST PR currently
under review.  (I promise, I will get to the topic at hand eventually ;) :


   - Where functionality should live
   - Who is responsible for what

I will now make a couple (possibly controversial) statements (some of
which) we have actually discussed prior to this on the dev list:


   - I view Ambari as managing the install and the static configuration for
   Metron.  For us, this would include zookeeper configs as well as topology
   configuration.  This would be the persistent store of truth.
   - I view Zookeeper to be our runtime configuration store for the
   topologies.


   - I view a management UI (and the Stellar Shell) as managing
   functionality for interacting with the system.  Where it changes
   configuration, it must go through Ambari.
   - I believe the management UI should be exposed as an ambari view

As such, I see the importation and management of enrichments, which is a
data task, to be squarely in the purview of the management UI, whose job is
the care and feeding of the data.  That being said, any configuration
changes to USE the enrichment should at least be routed through ambari, but
should be managed in the UI.

Now the question becomes, should we have enrichment collateral (I'm
including both hbase as well as geo or anything else we have) loaded at
install-time.  I would argue that we should not.  Rather, we should design
the management UI so that the enrichments can be added easily, with a
wizard to enable the use of the enrichment via stellar for a sensor

On that topic, I think we are doing too much as part of our install.  I
would argue that we shouldn't pre-load even the geo data or depend on it
for the default parsers.

Casey



On Tue, Feb 21, 2017 at 6:31 PM, Michael Miklavcic <
[email protected]> wrote:

> With the work committed in
> https://github.com/apache/incubator-metron/pull/445 and
> https://github.com/apache/incubator-metron/pull/432, we now have a robust
> and flexible means to import enrichment sources and transform their
> contents as they are inserted into HBase. One of the main motivators for
> this new functionality was to add the ability to load top domain rankings
> from sources such as Alexa. The proposal is to make this type of enrichment
> a top-level feature in Metron by introducing it to the Ambari management UI
> as a configurable set of properties in the MPack install. This comes with
> some options and challenges in how we want to manage the configurations,
> which I will outline below.
>
> *Use cases:*
>
>    - Single load of top domains file
>    - Re-loading top domains file - need to be able to cleanup properly
>    - Cleaning up/deleting old enrichment data (this is a general feature
>    that we currently lack - I think it is worth a separate Jira/PR for
>    creating a MapReduce job that enables cleanup to occur).
>    - Modifying default top domains file source - there are other options
>    besides Alexa. And users may want to load a file from local URI since
> many
>    data centers do not have direct access to the internet.
>    - Ability to modify the default extractor config JSON and tune the
>    Stellar transformations for both the value and indicator transforms.
> Allows
>    more flexible handling of data based on other sources.
>    - Loading multiple top domains source enrichments. (Maybe a separate PR
>    for this if we even think it would be useful)
>    - Updating the top domain enrichment - This needs to be an atomic
>    operation in order to prevent incorrect data.
>    - Rolling back to an older version of the top domains enrichment. Also
>    needs to be atomic.
>    - Ability to schedule an enrichment load on schedule - we would like to
>    defer this to an external scheduling mechanism, e.g. cron or Control M.
> The
>    enrichment loading system should have the necessary features to enable
> this
>    type of automation without data integrity issues.
>
> *Considerations:*
>
>    - As mentioned above, we want to add this feature to the Ambari MPack.
>    This requires at least 2 parameters to work. We need the ability to
> specify
>    a URI as well as an extractor config.
>    - How do we want to manage the extractor config? The most obvious
>    solution is to provide a text field in Ambari with a default JSON
> config.
>    When a load is initiated, Ambari would place a fresh copy of the
> extractor
>    config in the /tmp/ directory. This is an ephemeral file that isn't
> needed
>    other than during a load.
>    - It seems easy enough to have the load occur during the initial
>    install, however subsequent loads would require a different workflow.
> How
>    do folks feel about adding a set of dropdown options in the Ambari UI
> for
>    loading, updating, and deleting the top domains enrichment? I believe we
>    are doing something similar for the ElasticSearch templates currently.
>    - In the case of atomic operations for updates and rollbacks, I propose
>    we add a property to Zookeeper that is reference-able in the enrichment
>    itself. The idea would be to create a "top-domains" property in ZK that
>    points to an enrichment key with a load timestamp associated with it,
> e.g.
>    top-domains_20170221042000. This would also allow a mapreduce job to be
>    written that cleans up old enrichments. Another option is to create a
> new
>    table in HBase if/when you update the enrichment and change the
> enrichment
>    config manually. Deleting an old enrichment would simply be a matter of
>    dropping the table in HBase. A relevant discussion of the tradeoffs of
>    having many small tables versus 1 large table can be found here -
>    http://grokbase.com/t/hbase/user/11bjbdw94q/multiple-
> tables-vs-big-fat-table
>    - In order to update or rollback an enrichment as mentioned above, we
>    would also ideally provide a mechanism for changing the rowkey pointed
> to
>    by the enrichment.
>
> In summary of the use cases and considerations above, this boils down to
> how we'd like to leverage Ambari here. Do we want Ambari to handle only the
> initial install/load and have end users be responsible on an ongoing basis
> for updates (users would be responsible for copying or distributing the
> extractor_config.json for instance), or do we want to enable Ambari to
> manage the configuration ongoing and enable functionality for reloading,
> updating, and rollback?
>
> Best,
> Mike
>

Re: [DISCUSS] Top domains enrichment config/extractor management

Reply via email to