Re: [DISCUSS] Generating and Interacting with serialized summary objects

Nick Allen Wed, 03 Jan 2018 09:22:06 -0800

Oh, gotcha.  That makes sense.  Thanks for clarifying.

On Wed, Jan 3, 2018 at 12:15 PM, Casey Stella <ceste...@gmail.com> wrote:


> It's actually many more than 1M.  There are 1M domains, each domain could
> have upwards of 300 - 1000 possible typosquatted domains.
>
> You will notice from
> https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter
> that we are not adding the domain to the bloom filter, we're adding each
> domain generated from DOMAIN_TYPOSQUAT to the bloom filter.  In fact, we
> would very specifically NOT want the base domain as that would not be an
> indication of typosquatting (going to google.com would be legit, going to
> goggle.com would not).
>
>
>
> On Wed, Jan 3, 2018 at 10:48 AM, Nick Allen <n...@nickallen.org> wrote:
>
> > > Even with 5 threads, it takes an hour for the full Alexa 1m, so I
> think
> > this will impact performance
> >
> > What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> > seems really high, unless I am not understanding something.
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <ceste...@gmail.com>
> wrote:
> >
> > > Thanks for the feedback, Nick.
> > >
> > > Regarding "IMHO, I'd rather not reinvent the wheel for text
> > manipulation."
> > >
> > > I would argue that we are not reinventing the wheel for text
> manipulation
> > > as the extractor config exists already and we are doing a similar thing
> > in
> > > the flatfile loader (in fact, the code is reused and merely extended).
> > > Transformation operations are already supported in our codebase in the
> > > extractor config, this PR has just added some hooks for stateful
> > > operations.
> > >
> > > Furthermore, we will need a configuration object to pass to the REST
> call
> > > if we are ever to create a UI around importing data into hbase or
> > creating
> > > these summary objects.
> > >
> > > Regarding your example:
> > > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > >
> > > I'm very sympathetic to this type of extension, but it has some issues:
> > >
> > >    1. This implies a single-threaded addition to the bloom filter.
> > >       1. Even with 5 threads, it takes an hour for the full alexa 1m,
> so
> > I
> > >       think this will impact performance
> > >       2. There's not a way to specify how to merge across threads if we
> > do
> > >       make a multithread command line option
> > >    2. This restricts these kinds of operations to roles with heavy unix
> > CLI
> > >    knowledge, which isn't often the types of people who would be doing
> > this
> > >    type of operation
> > >    3. What if we need two variables passed to stellar?
> > >    4. This approach will be harder to move to Hadoop.  Eventually we
> will
> > >    want to support data on HDFS being processed by Hadoop (similar to
> > > flatfile
> > >    loader), so instead of -m LOCAL being passed for the flatfile
> > summarizer
> > >    you'd pass -m SPARK and the processing would happen on the cluster
> > >       1. This is particularly relevant in this case as it's a
> > >       embarrassingly parallel problem in general
> > >
> > > In summary, while this a CLI approach is attractive, I prefer the
> > extractor
> > > config solution because it is the solution with the smallest iteration
> > > that:
> > >
> > >    1. Reuses existing metron extraction infrastructure
> > >    2. Provides the most solid base for the extensions that will be
> sorely
> > >    needed soon (and will keep it in parity with the flatfile loader)
> > >    3. Provides the most solid base for a future UI extension in the
> > >    management UI to support both summarization and loading
> > >
> > >
> > >
> > >
> > > On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <n...@nickallen.org>
> wrote:
> > >
> > > > First off, I really do like the typosquatting use case and a lot of
> > what
> > > > you have described.
> > > >
> > > > > We need a way to generate the summary sketches from flat data for
> > this
> > > to
> > > > > work.
> > > > > ..
> > > > >
> > > >
> > > > I took this quote directly from your use case.  Above is the point
> that
> > > I'd
> > > > like to discuss and what your proposed solutions center on.  This is
> > > what I
> > > > think you are trying to do, at least with PR #879
> > > > <https://github.com/apache/metron/pull/879>...
> > > >
> > > > (Q) Can we repurpose Stellar functions so that they can operate on
> text
> > > > stored in a file system?
> > > >
> > > >
> > > > Whether we use the (1) Configuration or the (2) Function-based
> approach
> > > > that you described, fundamentally we are introducing new ways to
> > perform
> > > > text manipulation inside of Stellar.
> > > >
> > > > IMHO, I'd rather not reinvent the wheel for text manipulation.  It
> > would
> > > be
> > > > painful to implement and maintain a bunch of Stellar functions for
> text
> > > > manipulation.  People already have a large number of tools available
> to
> > > do
> > > > this and everyone has their favorites.  People are resistant to
> > learning
> > > > something new when they already are familiar with another way to do
> the
> > > > same thing.
> > > >
> > > > So then the question is, how else can we do this?  My suggestion is
> > that
> > > > rather than introducing text manipulation tools inside of Stellar, we
> > > allow
> > > > people to use the text manipulation tools they already know, but with
> > the
> > > > Stellar functions that we already have.  And the obvious way to tie
> > those
> > > > two things together is the Unix pipeline.
> > > >
> > > > A quick, albeit horribly incomplete, example to flesh this out a bit
> > more
> > > > based on the example you have in PR #879
> > > > <https://github.com/apache/metron/pull/879>.  This would allow me to
> > > > integrate Stellar with whatever external tools that I want.
> > > >
> > > > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > > > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <ceste...@gmail.com>
> > > wrote:
> > > >
> > > > > I'll start this discussion off with my idea around a 2nd step that
> is
> > > > more
> > > > > adaptable.  I propose the following set of stellar functions backed
> > by
> > > > > Spark in the metron-management project:
> > > > >
> > > > >    - CSV_PARSE(location, separator?, columns?) : Constructs a Spark
> > > > >    Dataframe for reading the flatfile
> > > > >    - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
> > > > > dataframe
> > > > >    - SUMMARIZE(state_init, state_update, state_merge): Summarize
> the
> > > > >    dataframe using the lambda functions:
> > > > >       - state_init - executed once per worker to initialize the
> state
> > > > >       - state_update - executed once per row
> > > > >       - state_merge - Merge the worker states into one worker state
> > > > >    - OBJECT_SAVE(obj, output_path) : Save the object obj to the
> path
> > > > >    output_path on HDFS.
> > > > >
> > > > > This would enable more flexibility and composibility than the
> > > > > configuration-based approach that we have in the flatfile loader.
> > > > > My concern with this approach, and the reason I didn't do it
> > initially,
> > > > was
> > > > > that I think that users will want at least 2 ways to summarize data
> > (or
> > > > > load data):
> > > > >
> > > > >    - A configuration based approach, which enables a UI
> > > > >    - A set of stellar functions via the scriptable REPL
> > > > >
> > > > > I would argue that both have a place and I started with the
> > > configuration
> > > > > based approach as it was a more natural extension of what we
> already
> > > had.
> > > > > I'd love to hear thoughts about this idea too.
> > > > >
> > > > >
> > > > > On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I wanted to get some feedback on a sensible plan for something.
> It
> > > > > > occurred to me the other day when considering the use-case of
> > > detecting
> > > > > > typosquatted domains, that one approach was to generate the set
> of
> > > > > > typosquatted domains for some set of reference domains and
> compare
> > > > > domains
> > > > > > as they flow through.
> > > > > >
> > > > > > One way we could do this would be to generate this data and
> import
> > > the
> > > > > > typosquatted domains into HBase.  I thought, however, that
> another
> > > > > approach
> > > > > > which may trade-off accuracy to remove the network hop and
> > potential
> > > > disk
> > > > > > seek by constructing a bloom filter that includes the set of
> > > > typosquatted
> > > > > > domains.
> > > > > >
> > > > > > The challenge was that we don't have a way to do this currently.
> > We
> > > > do,
> > > > > > however, have a loading infrastructure (e.g. the flatfile_loader)
> > and
> > > > > > configuration (see https://github.com/apache/
> > > > metron/tree/master/metron-
> > > > > > platform/metron-data-management#common-extractor-properties)
> > which
> > > > > > handles:
> > > > > >
> > > > > >    - parsing flat files
> > > > > >    - transforming the rows
> > > > > >    - filtering the rows
> > > > > >
> > > > > > To enable the new use-case of generating a summary object (e.g. a
> > > bloom
> > > > > > filter), in METRON-1378 (https://github.com/apache/
> metron/pull/879
> > )
> > > I
> > > > > > propose that we create a new utility that uses the same extractor
> > > > config
> > > > > > add the ability to:
> > > > > >
> > > > > >    - initialize a state object
> > > > > >    - update the object for every row
> > > > > >    - merge the state objects (in the case of multiple threads, in
> > the
> > > > > >    case of one thread it's not needed).
> > > > > >
> > > > > > I think this is a sensible decision because:
> > > > > >
> > > > > >    - It's a minimal movement from the flat file loader
> > > > > >       - Uses the same configs
> > > > > >       - Abstracts and reuses the existing infrastructure
> > > > > >    - Having one extractor config means that it should be easier
> to
> > > > > >    generate a UI around this to simplify the experience
> > > > > >
> > > > > > All that being said, our extractor config is..shall we
> > say...daunting
> > > > :).
> > > > > > I am sensitive to the fact that this adds to an existing
> difficult
> > > > > config.
> > > > > > I propose that this is an initial step forward to support the
> > > use-case
> > > > > and
> > > > > > we can enable something more composable going forward.  My
> concern
> > in
> > > > > > considering this as the first step was that it felt that the
> > > composable
> > > > > > units for data transformation and manipulation suddenly takes us
> > > into a
> > > > > > place where Stellar starts to look like Pig or Spark RDD API.  I
> > > wasn't
> > > > > > ready for that without a lot more discussion.
> > > > > >
> > > > > > To summarize, what I'd like to get from the community is, after
> > > > reviewing
> > > > > > the entire use-case at https://github.com/cestella/
> > > > > incubator-metron/tree/
> > > > > > typosquat_merge/use-cases/typosquat_detection:
> > > > > >
> > > > > >    - Is this so confusing that it does not belong in Metron even
> > as a
> > > > > >    first-step?
> > > > > >    - Is there a way to extend the extractor config in a less
> > > confusing
> > > > > >    way to enable this?
> > > > > >
> > > > > > I apologize for making the discuss thread *after* the JIRAs, but
> I
> > > felt
> > > > > > this one might bear having some working code to consider.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to