First off, I really do like the typosquatting use case and a lot of what
you have described.

> We need a way to generate the summary sketches from flat data for this to
> work.
> ​..​
>

I took this quote directly from your use case.  Above is the point that I'd
like to discuss and what your proposed solutions center on.  This is what I
think you are trying to do, at least with PR #879
<https://github.com/apache/metron/pull/879>...

(Q) Can we repurpose Stellar functions so that they can operate on text
stored in a file system?


Whether we use the (1) Configuration or the (2) Function-based approach
that you described, fundamentally we are introducing new ways to perform
text manipulation inside of Stellar.

IMHO, I'd rather not reinvent the wheel for text manipulation.  It would be
painful to implement and maintain a bunch of Stellar functions for text
manipulation.  People already have a large number of tools available to do
this and everyone has their favorites.  People are resistant to learning
something new when they already are familiar with another way to do the
same thing.

So then the question is, how else can we do this?  My suggestion is that
rather than introducing text manipulation tools inside of Stellar, we allow
people to use the text manipulation tools they already know, but with the
Stellar functions that we already have.  And the obvious way to tie those
two things together is the Unix pipeline.

A quick, albeit horribly incomplete, example to flesh this out a bit more
based on the example you have in PR #879
<https://github.com/apache/metron/pull/879>.  This would allow me to
integrate Stellar with whatever external tools that I want.

$ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'








On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <ceste...@gmail.com> wrote:

> I'll start this discussion off with my idea around a 2nd step that is more
> adaptable.  I propose the following set of stellar functions backed by
> Spark in the metron-management project:
>
>    - CSV_PARSE(location, separator?, columns?) : Constructs a Spark
>    Dataframe for reading the flatfile
>    - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
> dataframe
>    - SUMMARIZE(state_init, state_update, state_merge): Summarize the
>    dataframe using the lambda functions:
>       - state_init - executed once per worker to initialize the state
>       - state_update - executed once per row
>       - state_merge - Merge the worker states into one worker state
>    - OBJECT_SAVE(obj, output_path) : Save the object obj to the path
>    output_path on HDFS.
>
> This would enable more flexibility and composibility than the
> configuration-based approach that we have in the flatfile loader.
> My concern with this approach, and the reason I didn't do it initially, was
> that I think that users will want at least 2 ways to summarize data (or
> load data):
>
>    - A configuration based approach, which enables a UI
>    - A set of stellar functions via the scriptable REPL
>
> I would argue that both have a place and I started with the configuration
> based approach as it was a more natural extension of what we already had.
> I'd love to hear thoughts about this idea too.
>
>
> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> wrote:
>
> > Hi all,
> >
> > I wanted to get some feedback on a sensible plan for something.  It
> > occurred to me the other day when considering the use-case of detecting
> > typosquatted domains, that one approach was to generate the set of
> > typosquatted domains for some set of reference domains and compare
> domains
> > as they flow through.
> >
> > One way we could do this would be to generate this data and import the
> > typosquatted domains into HBase.  I thought, however, that another
> approach
> > which may trade-off accuracy to remove the network hop and potential disk
> > seek by constructing a bloom filter that includes the set of typosquatted
> > domains.
> >
> > The challenge was that we don't have a way to do this currently.  We do,
> > however, have a loading infrastructure (e.g. the flatfile_loader) and
> > configuration (see https://github.com/apache/metron/tree/master/metron-
> > platform/metron-data-management#common-extractor-properties)  which
> > handles:
> >
> >    - parsing flat files
> >    - transforming the rows
> >    - filtering the rows
> >
> > To enable the new use-case of generating a summary object (e.g. a bloom
> > filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> > propose that we create a new utility that uses the same extractor config
> > add the ability to:
> >
> >    - initialize a state object
> >    - update the object for every row
> >    - merge the state objects (in the case of multiple threads, in the
> >    case of one thread it's not needed).
> >
> > I think this is a sensible decision because:
> >
> >    - It's a minimal movement from the flat file loader
> >       - Uses the same configs
> >       - Abstracts and reuses the existing infrastructure
> >    - Having one extractor config means that it should be easier to
> >    generate a UI around this to simplify the experience
> >
> > All that being said, our extractor config is..shall we say...daunting :).
> > I am sensitive to the fact that this adds to an existing difficult
> config.
> > I propose that this is an initial step forward to support the use-case
> and
> > we can enable something more composable going forward.  My concern in
> > considering this as the first step was that it felt that the composable
> > units for data transformation and manipulation suddenly takes us into a
> > place where Stellar starts to look like Pig or Spark RDD API.  I wasn't
> > ready for that without a lot more discussion.
> >
> > To summarize, what I'd like to get from the community is, after reviewing
> > the entire use-case at https://github.com/cestella/
> incubator-metron/tree/
> > typosquat_merge/use-cases/typosquat_detection:
> >
> >    - Is this so confusing that it does not belong in Metron even as a
> >    first-step?
> >    - Is there a way to extend the extractor config in a less confusing
> >    way to enable this?
> >
> > I apologize for making the discuss thread *after* the JIRAs, but I felt
> > this one might bear having some working code to consider.
> >
>

Reply via email to