Re: [DISCUSS] Top domains enrichment config/extractor management

Nick Allen Fri, 24 Feb 2017 06:31:26 -0800

>
>
> we now have a robust
>  
> and flexible means to import enrichment sources and transform their
> contents as they are inserted into HBase. One of the main motivators for
> this new functionality was to add the ability to load top domain rankings
> from sources such as Alexa. The proposal is to make this type of enrichment
> a top-level feature in Metron by introducing it to the Ambari management UI



(1) In thinking through how the UI should work here, we should consider
data sources beyond just those that would be loaded in HBase.  I would
think the UI should be a single view of all data sources, no matter whether
they load into HBase or not.

It would also be good to think through how the solution might handle
updating other types of data source, like the geo data, for instance. The
geo data is something that needs to be updated on a regular basis.  Could
this solution also manage that?

I know Maxmind has a bit of code to manage updating their data, but I am
not familiar with what it does or how it works.  Researching that might
help inform this conversation.


> How do folks feel about adding a set of dropdown options in the Ambari UI
> for loading, updating, and deleting the top domains enrichment?


(2) I think if this functionality is truly useful, there is likely going to
be lots of different data sources that would be made available.  Many of
which will NOT be applicable or desirable in every environment.

This would be akin to packages or RPMs that are available to install on
CentOS.  There are many to choose from, but in my specific environment
there are many that I do not care about.

Is an Ambari drop down scalable considering this usage pattern?

Do we want Ambari to handle only the
> initial install/load and have end users be responsible on an ongoing basis
> for updates (users would be responsible for copying or distributing the
> extractor_config.json for instance), or do we want to enable Ambari to
> manage the configuration ongoing and enable functionality for reloading,
> updating, and rollback?


(3) Whatever solution we land on, it should handle refreshing/reloading the
data on a regular basis.  This is something that has to be done for almost
every useful data source and so should be baked into the solution. I don't
think the functionality is that useful otherwise.

(4) Another thing to consider is extensibility and ease of use.  If we can
make it really easy to provide a means for loading a data source into
Metron, then it is more likely that we will have community members willing
to do that work.

For example, think about the Homebrew project.  They make it stupid simple
to add a new installable package.  You don't have to know how Homebrew
works to contribute a package.  The result is they have tons of packages
available.

Does the Ambari MPack provide the right level of ease of use for that?





On Tue, Feb 21, 2017 at 6:31 PM, Michael Miklavcic <
[email protected]> wrote:

> With the work committed in
> https://github.com/apache/incubator-metron/pull/445 and
> https://github.com/apache/incubator-metron/pull/432, we now have a robust
> and flexible means to import enrichment sources and transform their
> contents as they are inserted into HBase. One of the main motivators for
> this new functionality was to add the ability to load top domain rankings
> from sources such as Alexa. The proposal is to make this type of enrichment
> a top-level feature in Metron by introducing it to the Ambari management UI
> as a configurable set of properties in the MPack install. This comes with
> some options and challenges in how we want to manage the configurations,
> which I will outline below.
>
> *Use cases:*
>
>    - Single load of top domains file
>    - Re-loading top domains file - need to be able to cleanup properly
>    - Cleaning up/deleting old enrichment data (this is a general feature
>    that we currently lack - I think it is worth a separate Jira/PR for
>    creating a MapReduce job that enables cleanup to occur).
>    - Modifying default top domains file source - there are other options
>    besides Alexa. And users may want to load a file from local URI since
> many
>    data centers do not have direct access to the internet.
>    - Ability to modify the default extractor config JSON and tune the
>    Stellar transformations for both the value and indicator transforms.
> Allows
>    more flexible handling of data based on other sources.
>    - Loading multiple top domains source enrichments. (Maybe a separate PR
>    for this if we even think it would be useful)
>    - Updating the top domain enrichment - This needs to be an atomic
>    operation in order to prevent incorrect data.
>    - Rolling back to an older version of the top domains enrichment. Also
>    needs to be atomic.
>    - Ability to schedule an enrichment load on schedule - we would like to
>    defer this to an external scheduling mechanism, e.g. cron or Control M.
> The
>    enrichment loading system should have the necessary features to enable
> this
>    type of automation without data integrity issues.
>
> *Considerations:*
>
>    - As mentioned above, we want to add this feature to the Ambari MPack.
>    This requires at least 2 parameters to work. We need the ability to
> specify
>    a URI as well as an extractor config.
>    - How do we want to manage the extractor config? The most obvious
>    solution is to provide a text field in Ambari with a default JSON
> config.
>    When a load is initiated, Ambari would place a fresh copy of the
> extractor
>    config in the /tmp/ directory. This is an ephemeral file that isn't
> needed
>    other than during a load.
>    - It seems easy enough to have the load occur during the initial
>    install, however subsequent loads would require a different workflow.
> How
>    do folks feel about adding a set of dropdown options in the Ambari UI
> for
>    loading, updating, and deleting the top domains enrichment? I believe we
>    are doing something similar for the ElasticSearch templates currently.
>    - In the case of atomic operations for updates and rollbacks, I propose
>    we add a property to Zookeeper that is reference-able in the enrichment
>    itself. The idea would be to create a "top-domains" property in ZK that
>    points to an enrichment key with a load timestamp associated with it,
> e.g.
>    top-domains_20170221042000. This would also allow a mapreduce job to be
>    written that cleans up old enrichments. Another option is to create a
> new
>    table in HBase if/when you update the enrichment and change the
> enrichment
>    config manually. Deleting an old enrichment would simply be a matter of
>    dropping the table in HBase. A relevant discussion of the tradeoffs of
>    having many small tables versus 1 large table can be found here -
>    http://grokbase.com/t/hbase/user/11bjbdw94q/multiple-
> tables-vs-big-fat-table
>    - In order to update or rollback an enrichment as mentioned above, we
>    would also ideally provide a mechanism for changing the rowkey pointed
> to
>    by the enrichment.
>
> In summary of the use cases and considerations above, this boils down to
> how we'd like to leverage Ambari here. Do we want Ambari to handle only the
> initial install/load and have end users be responsible on an ongoing basis
> for updates (users would be responsible for copying or distributing the
> extractor_config.json for instance), or do we want to enable Ambari to
> manage the configuration ongoing and enable functionality for reloading,
> updating, and rollback?
>
> Best,
> Mike
>

Re: [DISCUSS] Top domains enrichment config/extractor management

Reply via email to