[DISCUSS] Generating and Interacting with serialized summary objects

Casey Stella Sun, 24 Dec 2017 17:21:07 -0800

Hi all,

I wanted to get some feedback on a sensible plan for something.  It
occurred to me the other day when considering the use-case of detecting
typosquatted domains, that one approach was to generate the set of
typosquatted domains for some set of reference domains and compare domains
as they flow through.


One way we could do this would be to generate this data and import the
typosquatted domains into HBase.  I thought, however, that another approach
which may trade-off accuracy to remove the network hop and potential disk
seek by constructing a bloom filter that includes the set of typosquatted
domains.

The challenge was that we don't have a way to do this currently.  We do,
however, have a loading infrastructure (e.g. the flatfile_loader) and
configuration (see
https://github.com/apache/metron/tree/master/metron-platform/metron-data-management#common-extractor-properties)
which handles:

   - parsing flat files
   - transforming the rows
   - filtering the rows

To enable the new use-case of generating a summary object (e.g. a bloom
filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
propose that we create a new utility that uses the same extractor config
add the ability to:

   - initialize a state object
   - update the object for every row
   - merge the state objects (in the case of multiple threads, in the case
   of one thread it's not needed).

I think this is a sensible decision because:

   - It's a minimal movement from the flat file loader
      - Uses the same configs
      - Abstracts and reuses the existing infrastructure
   - Having one extractor config means that it should be easier to generate
   a UI around this to simplify the experience

All that being said, our extractor config is..shall we say...daunting :).
I am sensitive to the fact that this adds to an existing difficult config.
I propose that this is an initial step forward to support the use-case and
we can enable something more composable going forward.  My concern in
considering this as the first step was that it felt that the composable
units for data transformation and manipulation suddenly takes us into a
place where Stellar starts to look like Pig or Spark RDD API.  I wasn't
ready for that without a lot more discussion.

To summarize, what I'd like to get from the community is, after reviewing
the entire use-case at
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection
:

   - Is this so confusing that it does not belong in Metron even as a
   first-step?
   - Is there a way to extend the extractor config in a less confusing way
   to enable this?

I apologize for making the discuss thread *after* the JIRAs, but I felt
this one might bear having some working code to consider.

[DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to