I'll start this discussion off with my idea around a 2nd step that is more
adaptable.  I propose the following set of stellar functions backed by
Spark in the metron-management project:

   - CSV_PARSE(location, separator?, columns?) : Constructs a Spark
   Dataframe for reading the flatfile
   - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the dataframe
   - SUMMARIZE(state_init, state_update, state_merge): Summarize the
   dataframe using the lambda functions:
      - state_init - executed once per worker to initialize the state
      - state_update - executed once per row
      - state_merge - Merge the worker states into one worker state
   - OBJECT_SAVE(obj, output_path) : Save the object obj to the path
   output_path on HDFS.

This would enable more flexibility and composibility than the
configuration-based approach that we have in the flatfile loader.
My concern with this approach, and the reason I didn't do it initially, was
that I think that users will want at least 2 ways to summarize data (or
load data):

   - A configuration based approach, which enables a UI
   - A set of stellar functions via the scriptable REPL

I would argue that both have a place and I started with the configuration
based approach as it was a more natural extension of what we already had.
I'd love to hear thoughts about this idea too.


On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> wrote:

> Hi all,
>
> I wanted to get some feedback on a sensible plan for something.  It
> occurred to me the other day when considering the use-case of detecting
> typosquatted domains, that one approach was to generate the set of
> typosquatted domains for some set of reference domains and compare domains
> as they flow through.
>
> One way we could do this would be to generate this data and import the
> typosquatted domains into HBase.  I thought, however, that another approach
> which may trade-off accuracy to remove the network hop and potential disk
> seek by constructing a bloom filter that includes the set of typosquatted
> domains.
>
> The challenge was that we don't have a way to do this currently.  We do,
> however, have a loading infrastructure (e.g. the flatfile_loader) and
> configuration (see https://github.com/apache/metron/tree/master/metron-
> platform/metron-data-management#common-extractor-properties)  which
> handles:
>
>    - parsing flat files
>    - transforming the rows
>    - filtering the rows
>
> To enable the new use-case of generating a summary object (e.g. a bloom
> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> propose that we create a new utility that uses the same extractor config
> add the ability to:
>
>    - initialize a state object
>    - update the object for every row
>    - merge the state objects (in the case of multiple threads, in the
>    case of one thread it's not needed).
>
> I think this is a sensible decision because:
>
>    - It's a minimal movement from the flat file loader
>       - Uses the same configs
>       - Abstracts and reuses the existing infrastructure
>    - Having one extractor config means that it should be easier to
>    generate a UI around this to simplify the experience
>
> All that being said, our extractor config is..shall we say...daunting :).
> I am sensitive to the fact that this adds to an existing difficult config.
> I propose that this is an initial step forward to support the use-case and
> we can enable something more composable going forward.  My concern in
> considering this as the first step was that it felt that the composable
> units for data transformation and manipulation suddenly takes us into a
> place where Stellar starts to look like Pig or Spark RDD API.  I wasn't
> ready for that without a lot more discussion.
>
> To summarize, what I'd like to get from the community is, after reviewing
> the entire use-case at https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection:
>
>    - Is this so confusing that it does not belong in Metron even as a
>    first-step?
>    - Is there a way to extend the extractor config in a less confusing
>    way to enable this?
>
> I apologize for making the discuss thread *after* the JIRAs, but I felt
> this one might bear having some working code to consider.
>

Reply via email to