Re: [DISCUSS] Generating and Interacting with serialized summary objects

Otto Fowler Sun, 24 Dec 2017 17:54:07 -0800

1st.  You are not only one looking at the list on Dec 24th Casey, so don’t
feel bad.



2nd.  Maybe we can separate this into 2 areas of concern.

1. Stellar can load objects into ‘caches’ from some repository and refer to
them.
2. The repositories
3. Some number of strategies to populate and possibly update the
repository, from spark,
to MR jobs to whatever you would classify the flat file stuff as.

wait, separate this into 3, 3 areas of concern!

1. Stellar can load objects into ‘caches’ from some repository and refer to
them.
2. The repositories
3. Some number of strategies to populate and possibly update the
repository, from spark,
to MR jobs to whatever you would classify the flat file stuff as.
4. Let the Stellar API for everything but LOAD() follow after we get usage

4!  4 areas of concern..

wait, I’ll write another reply….




(Nobody expected that!)



On December 24, 2017 at 20:47:17, Casey Stella (ceste...@gmail.com) wrote:

Oh, one more thing, while the example here is around typosquatting, this is
of use outside of that. Pretty much any large existence-style query can be
enabled via this construction (create a summary bloom filter). There are
other use-cases involving other data structures too.

On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> wrote:

> Hi all,
>
> I wanted to get some feedback on a sensible plan for something. It
> occurred to me the other day when considering the use-case of detecting
> typosquatted domains, that one approach was to generate the set of
> typosquatted domains for some set of reference domains and compare
domains
> as they flow through.
>
> One way we could do this would be to generate this data and import the
> typosquatted domains into HBase. I thought, however, that another
approach
> which may trade-off accuracy to remove the network hop and potential disk
> seek by constructing a bloom filter that includes the set of typosquatted
> domains.
>
> The challenge was that we don't have a way to do this currently. We do,
> however, have a loading infrastructure (e.g. the flatfile_loader) and
> configuration (see https://github.com/apache/metron/tree/master/metron-
> platform/metron-data-management#common-extractor-properties) which
> handles:
>
> - parsing flat files
> - transforming the rows
> - filtering the rows
>
> To enable the new use-case of generating a summary object (e.g. a bloom
> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> propose that we create a new utility that uses the same extractor config
> add the ability to:
>
> - initialize a state object
> - update the object for every row
> - merge the state objects (in the case of multiple threads, in the
> case of one thread it's not needed).
>
> I think this is a sensible decision because:
>
> - It's a minimal movement from the flat file loader
> - Uses the same configs
> - Abstracts and reuses the existing infrastructure
> - Having one extractor config means that it should be easier to
> generate a UI around this to simplify the experience
>
> All that being said, our extractor config is..shall we say...daunting :).
> I am sensitive to the fact that this adds to an existing difficult
config.
> I propose that this is an initial step forward to support the use-case
and
> we can enable something more composable going forward. My concern in
> considering this as the first step was that it felt that the composable
> units for data transformation and manipulation suddenly takes us into a
> place where Stellar starts to look like Pig or Spark RDD API. I wasn't
> ready for that without a lot more discussion.
>
> To summarize, what I'd like to get from the community is, after reviewing
> the entire use-case at https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection:
>
> - Is this so confusing that it does not belong in Metron even as a
> first-step?
> - Is there a way to extend the extractor config in a less confusing
> way to enable this?
>
> I apologize for making the discuss thread *after* the JIRAs, but I felt
> this one might bear having some working code to consider.
>

Re: [DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to