Re: [DISCUSS] Generating and Interacting with serialized summary objects

Casey Stella Sun, 24 Dec 2017 17:48:07 -0800

Oh, one more thing, while the example here is around typosquatting, this is
of use outside of that.  Pretty much any large existence-style query can be
enabled via this construction (create a summary bloom filter).  There are
other use-cases involving other data structures too.


On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <[email protected]> wrote:

> Hi all,
>
> I wanted to get some feedback on a sensible plan for something.  It
> occurred to me the other day when considering the use-case of detecting
> typosquatted domains, that one approach was to generate the set of
> typosquatted domains for some set of reference domains and compare domains
> as they flow through.
>
> One way we could do this would be to generate this data and import the
> typosquatted domains into HBase.  I thought, however, that another approach
> which may trade-off accuracy to remove the network hop and potential disk
> seek by constructing a bloom filter that includes the set of typosquatted
> domains.
>
> The challenge was that we don't have a way to do this currently.  We do,
> however, have a loading infrastructure (e.g. the flatfile_loader) and
> configuration (see https://github.com/apache/metron/tree/master/metron-
> platform/metron-data-management#common-extractor-properties)  which
> handles:
>
>    - parsing flat files
>    - transforming the rows
>    - filtering the rows
>
> To enable the new use-case of generating a summary object (e.g. a bloom
> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> propose that we create a new utility that uses the same extractor config
> add the ability to:
>
>    - initialize a state object
>    - update the object for every row
>    - merge the state objects (in the case of multiple threads, in the
>    case of one thread it's not needed).
>
> I think this is a sensible decision because:
>
>    - It's a minimal movement from the flat file loader
>       - Uses the same configs
>       - Abstracts and reuses the existing infrastructure
>    - Having one extractor config means that it should be easier to
>    generate a UI around this to simplify the experience
>
> All that being said, our extractor config is..shall we say...daunting :).
> I am sensitive to the fact that this adds to an existing difficult config.
> I propose that this is an initial step forward to support the use-case and
> we can enable something more composable going forward.  My concern in
> considering this as the first step was that it felt that the composable
> units for data transformation and manipulation suddenly takes us into a
> place where Stellar starts to look like Pig or Spark RDD API.  I wasn't
> ready for that without a lot more discussion.
>
> To summarize, what I'd like to get from the community is, after reviewing
> the entire use-case at https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection:
>
>    - Is this so confusing that it does not belong in Metron even as a
>    first-step?
>    - Is there a way to extend the extractor config in a less confusing
>    way to enable this?
>
> I apologize for making the discuss thread *after* the JIRAs, but I felt
> this one might bear having some working code to consider.
>

Re: [DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to