Re: [DISCUSS] Generating and Interacting with serialized summary objects

James Sirota Wed, 03 Jan 2018 22:41:33 -0800

I just went through these pull requests as well and also agree this is good 
work.  I think it's a good first pass.  I would be careful with trying to boil 
the ocean here.  I think for the initial use case I would only support loading 
the bloom filters from HDFS.  If people want to pre-process the CSV file of 
domains using awk or sed this should be out of scope of this work.  It's easy 
enough to do out of band and I would not include any of these functions at all. 
  I also think that the config could be considerably simplified.  I think 
value_filter should be removed (since I believe that preprocessing should be 
done by the user outside of this process).  I also have a question about the 
init, update, and merge configurations.  Would I ever initialize to anything 
but an empty bloom filter?  For the state update would I ever do anything other 
than add to the bloom filter?  For the state merge would I ever do anything 
other than merge the states?  If the answer to these is 'no', then this should 
simply be hard coded and not externalized into config values.


03.01.2018, 14:20, "Michael Miklavcic" <michael.miklav...@gmail.com>:
> I just finished stepping through the typosquatting use case README in your
> merge branch. This is really, really good work Casey. I see most of our
> previous documentation issues addressed up front, e.g. special variables
> are cited, all new fields explained, side effects documented. The use case
> doc brings it all together soup-to-nuts and I think all the pieces make
> sense in a mostly self-contained way. I can't think of anything I had to
> sit and think about for more than a few seconds. I'll be making my way
> through your individual PR's in more detail, but my first impressions are
> that this is excellent.
>
> On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>>  I'm liking this design and growth strategy, Casey. I also think Nick and
>>  Otto have some valid points. I always find there's a natural tension
>>  between too little, just enough, and boiling the ocean and these discuss
>>  threads really help drive what the short and long term visions should look
>>  like.
>>
>>  On the subject of repositories and strategies, I agree that pluggable
>>  repos and strategies for modifying them would be useful. For the first
>>  pass, I'd really like to see HDFS with the proposed set of Stellar
>>  functions. This gives us a lot of bang for our buck - we can capitalize on
>>  a set of powerful features around existence checking earlier without having
>>  to worry about later interface changes impacting users. With the primary
>>  interface coming through the JSON config, we are building a nice facade
>>  that protects users from later implementation abstractions and
>>  improvements, all while providing a stable enough interface on which we can
>>  develop UI features as desired. I'd be interested to hear more about what
>>  features could be provided by a repository as time goes by. Federation,
>>  permissions, governance, metadata management, perhaps?
>>
>>  I also had some concern over duplicating existing Unix features. I think
>>  where I'm at has been largely addressed by Casey's comments on 1) scaling,
>>  2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches
>>  - 1 which is config-based and the other a composable set of functions gives
>>  us the ability to provide a core set of features that can later be easily
>>  expanded by users as the need arises. Here again I think the prescribed
>>  approach provides a strong first pass that we can then expand on without
>>  concern of future improvements becoming a hassle for end users.
>>
>>  Best,
>>  Mike
>>
>>  On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
>>  si...@simonellistonball.com> wrote:
>>
>>>  There is some really cool stuff happening here, if only I’d been allowed
>>>  to see the lists over Christmas... :)
>>>
>>>  A few thoughts...
>>>
>>>  I like Otto’s generalisation of the problem to include specific local
>>>  stellar objects in a cache loaded from a store (HDFS seems a natural, but
>>>  not only place, maybe even a web service / local microservicey object
>>>  provider!?) That said, I suspect that’s a good platform optimisation
>>>  approach. Should we look at this as a separate piece of work given it
>>>  extends beyond the scope of the summarisation concept and ultimately use it
>>>  as a back-end to feed the summarising engine proposed here for the
>>>  enrichment loader?
>>>
>>>  On the more specific use case, one think I would comment on is the
>>>  configuration approach. The iteration loop (state_{init|update|merge}
>>>  should be consistent with the way we handle things like the profiler
>>>  config, since it’s the same approach to data handling.
>>>
>>>  The other thing that seems to have crept in here is the interface to
>>>  something like Spark, which again, I am really very very keen on seeing
>>>  happen. That said, not sure how that would happen in this context, unless
>>>  you’re talking about pushing to something like livy for example (eminently
>>>  sensible for things like cross instance caching and faster RPC-ish access
>>>  to an existing spark context which seem to be what Casey is driving at with
>>>  the spark piece.
>>>
>>>  To address the question of text manipulation in Stellar / metron
>>>  enrichment ingest etc, we already have this outside of the context of the
>>>  issues here. I would argue that yes, we don’t want too many paths for this,
>>>  and that maybe our parser approach might be heavily related to text-based
>>>  ingest. I would say the scope worth dealing with here though is not really
>>>  text manipulation, but summarisation, which is not well served by existing
>>>  CLI tools like awk / sed and friends.
>>>
>>>  Simon
>>>
>>>  > On 3 Jan 2018, at 15:48, Nick Allen <n...@nickallen.org> wrote:
>>>  >
>>>  >> Even with 5 threads, it takes an hour for the full Alexa 1m, so I
>>>  think
>>>  > this will impact performance
>>>  >
>>>  > What exactly takes an hour? Adding 1M entries to a bloom filter? That
>>>  > seems really high, unless I am not understanding something.
>>>  >
>>>  >
>>>  >
>>>  >
>>>  >
>>>  >
>>>  > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <ceste...@gmail.com>
>>>  wrote:
>>>  >
>>>  >> Thanks for the feedback, Nick.
>>>  >>
>>>  >> Regarding "IMHO, I'd rather not reinvent the wheel for text
>>>  manipulation."
>>>  >>
>>>  >> I would argue that we are not reinventing the wheel for text
>>>  manipulation
>>>  >> as the extractor config exists already and we are doing a similar
>>>  thing in
>>>  >> the flatfile loader (in fact, the code is reused and merely extended).
>>>  >> Transformation operations are already supported in our codebase in the
>>>  >> extractor config, this PR has just added some hooks for stateful
>>>  >> operations.
>>>  >>
>>>  >> Furthermore, we will need a configuration object to pass to the REST
>>>  call
>>>  >> if we are ever to create a UI around importing data into hbase or
>>>  creating
>>>  >> these summary objects.
>>>  >>
>>>  >> Regarding your example:
>>>  >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
>>>  >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>>>  >>
>>>  >> I'm very sympathetic to this type of extension, but it has some issues:
>>>  >>
>>>  >> 1. This implies a single-threaded addition to the bloom filter.
>>>  >> 1. Even with 5 threads, it takes an hour for the full alexa 1m,
>>>  so I
>>>  >> think this will impact performance
>>>  >> 2. There's not a way to specify how to merge across threads if we
>>>  do
>>>  >> make a multithread command line option
>>>  >> 2. This restricts these kinds of operations to roles with heavy unix
>>>  CLI
>>>  >> knowledge, which isn't often the types of people who would be doing
>>>  this
>>>  >> type of operation
>>>  >> 3. What if we need two variables passed to stellar?
>>>  >> 4. This approach will be harder to move to Hadoop. Eventually we
>>>  will
>>>  >> want to support data on HDFS being processed by Hadoop (similar to
>>>  >> flatfile
>>>  >> loader), so instead of -m LOCAL being passed for the flatfile
>>>  summarizer
>>>  >> you'd pass -m SPARK and the processing would happen on the cluster
>>>  >> 1. This is particularly relevant in this case as it's a
>>>  >> embarrassingly parallel problem in general
>>>  >>
>>>  >> In summary, while this a CLI approach is attractive, I prefer the
>>>  extractor
>>>  >> config solution because it is the solution with the smallest iteration
>>>  >> that:
>>>  >>
>>>  >> 1. Reuses existing metron extraction infrastructure
>>>  >> 2. Provides the most solid base for the extensions that will be
>>>  sorely
>>>  >> needed soon (and will keep it in parity with the flatfile loader)
>>>  >> 3. Provides the most solid base for a future UI extension in the
>>>  >> management UI to support both summarization and loading
>>>  >>
>>>  >>
>>>  >>
>>>  >>
>>>  >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <n...@nickallen.org>
>>>  wrote:
>>>  >>
>>>  >>> First off, I really do like the typosquatting use case and a lot of
>>>  what
>>>  >>> you have described.
>>>  >>>
>>>  >>>> We need a way to generate the summary sketches from flat data for
>>>  this
>>>  >> to
>>>  >>>> work.
>>>  >>>> ..
>>>  >>>>
>>>  >>>
>>>  >>> I took this quote directly from your use case. Above is the point
>>>  that
>>>  >> I'd
>>>  >>> like to discuss and what your proposed solutions center on. This is
>>>  >> what I
>>>  >>> think you are trying to do, at least with PR #879
>>>  >>> <https://github.com/apache/metron/pull/879>...
>>>  >>>
>>>  >>> (Q) Can we repurpose Stellar functions so that they can operate on
>>>  text
>>>  >>> stored in a file system?
>>>  >>>
>>>  >>>
>>>  >>> Whether we use the (1) Configuration or the (2) Function-based
>>>  approach
>>>  >>> that you described, fundamentally we are introducing new ways to
>>>  perform
>>>  >>> text manipulation inside of Stellar.
>>>  >>>
>>>  >>> IMHO, I'd rather not reinvent the wheel for text manipulation. It
>>>  would
>>>  >> be
>>>  >>> painful to implement and maintain a bunch of Stellar functions for
>>>  text
>>>  >>> manipulation. People already have a large number of tools available
>>>  to
>>>  >> do
>>>  >>> this and everyone has their favorites. People are resistant to
>>>  learning
>>>  >>> something new when they already are familiar with another way to do
>>>  the
>>>  >>> same thing.
>>>  >>>
>>>  >>> So then the question is, how else can we do this? My suggestion is
>>>  that
>>>  >>> rather than introducing text manipulation tools inside of Stellar, we
>>>  >> allow
>>>  >>> people to use the text manipulation tools they already know, but with
>>>  the
>>>  >>> Stellar functions that we already have. And the obvious way to tie
>>>  those
>>>  >>> two things together is the Unix pipeline.
>>>  >>>
>>>  >>> A quick, albeit horribly incomplete, example to flesh this out a bit
>>>  more
>>>  >>> based on the example you have in PR #879
>>>  >>> <https://github.com/apache/metron/pull/879>. This would allow me to
>>>  >>> integrate Stellar with whatever external tools that I want.
>>>  >>>
>>>  >>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
>>>  >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>>>  >>>
>>>  >>>
>>>  >>>
>>>  >>>
>>>  >>>
>>>  >>>
>>>  >>>
>>>  >>>
>>>  >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <ceste...@gmail.com>
>>>  >> wrote:
>>>  >>>
>>>  >>>> I'll start this discussion off with my idea around a 2nd step that is
>>>  >>> more
>>>  >>>> adaptable. I propose the following set of stellar functions backed
>>>  by
>>>  >>>> Spark in the metron-management project:
>>>  >>>>
>>>  >>>> - CSV_PARSE(location, separator?, columns?) : Constructs a Spark
>>>  >>>> Dataframe for reading the flatfile
>>>  >>>> - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
>>>  >>>> dataframe
>>>  >>>> - SUMMARIZE(state_init, state_update, state_merge): Summarize the
>>>  >>>> dataframe using the lambda functions:
>>>  >>>> - state_init - executed once per worker to initialize the state
>>>  >>>> - state_update - executed once per row
>>>  >>>> - state_merge - Merge the worker states into one worker state
>>>  >>>> - OBJECT_SAVE(obj, output_path) : Save the object obj to the path
>>>  >>>> output_path on HDFS.
>>>  >>>>
>>>  >>>> This would enable more flexibility and composibility than the
>>>  >>>> configuration-based approach that we have in the flatfile loader.
>>>  >>>> My concern with this approach, and the reason I didn't do it
>>>  initially,
>>>  >>> was
>>>  >>>> that I think that users will want at least 2 ways to summarize data
>>>  (or
>>>  >>>> load data):
>>>  >>>>
>>>  >>>> - A configuration based approach, which enables a UI
>>>  >>>> - A set of stellar functions via the scriptable REPL
>>>  >>>>
>>>  >>>> I would argue that both have a place and I started with the
>>>  >> configuration
>>>  >>>> based approach as it was a more natural extension of what we already
>>>  >> had.
>>>  >>>> I'd love to hear thoughts about this idea too.
>>>  >>>>
>>>  >>>>
>>>  >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com>
>>>  >>> wrote:
>>>  >>>>
>>>  >>>>> Hi all,
>>>  >>>>>
>>>  >>>>> I wanted to get some feedback on a sensible plan for something. It
>>>  >>>>> occurred to me the other day when considering the use-case of
>>>  >> detecting
>>>  >>>>> typosquatted domains, that one approach was to generate the set of
>>>  >>>>> typosquatted domains for some set of reference domains and compare
>>>  >>>> domains
>>>  >>>>> as they flow through.
>>>  >>>>>
>>>  >>>>> One way we could do this would be to generate this data and import
>>>  >> the
>>>  >>>>> typosquatted domains into HBase. I thought, however, that another
>>>  >>>> approach
>>>  >>>>> which may trade-off accuracy to remove the network hop and potential
>>>  >>> disk
>>>  >>>>> seek by constructing a bloom filter that includes the set of
>>>  >>> typosquatted
>>>  >>>>> domains.
>>>  >>>>>
>>>  >>>>> The challenge was that we don't have a way to do this currently. We
>>>  >>> do,
>>>  >>>>> however, have a loading infrastructure (e.g. the flatfile_loader)
>>>  and
>>>  >>>>> configuration (see https://github.com/apache/
>>>  >>> metron/tree/master/metron-
>>>  >>>>> platform/metron-data-management#common-extractor-properties) which
>>>  >>>>> handles:
>>>  >>>>>
>>>  >>>>> - parsing flat files
>>>  >>>>> - transforming the rows
>>>  >>>>> - filtering the rows
>>>  >>>>>
>>>  >>>>> To enable the new use-case of generating a summary object (e.g. a
>>>  >> bloom
>>>  >>>>> filter), in METRON-1378 (https://github.com/apache/metron/pull/879)
>>>  >> I
>>>  >>>>> propose that we create a new utility that uses the same extractor
>>>  >>> config
>>>  >>>>> add the ability to:
>>>  >>>>>
>>>  >>>>> - initialize a state object
>>>  >>>>> - update the object for every row
>>>  >>>>> - merge the state objects (in the case of multiple threads, in the
>>>  >>>>> case of one thread it's not needed).
>>>  >>>>>
>>>  >>>>> I think this is a sensible decision because:
>>>  >>>>>
>>>  >>>>> - It's a minimal movement from the flat file loader
>>>  >>>>> - Uses the same configs
>>>  >>>>> - Abstracts and reuses the existing infrastructure
>>>  >>>>> - Having one extractor config means that it should be easier to
>>>  >>>>> generate a UI around this to simplify the experience
>>>  >>>>>
>>>  >>>>> All that being said, our extractor config is..shall we
>>>  say...daunting
>>>  >>> :).
>>>  >>>>> I am sensitive to the fact that this adds to an existing difficult
>>>  >>>> config.
>>>  >>>>> I propose that this is an initial step forward to support the
>>>  >> use-case
>>>  >>>> and
>>>  >>>>> we can enable something more composable going forward. My concern
>>>  in
>>>  >>>>> considering this as the first step was that it felt that the
>>>  >> composable
>>>  >>>>> units for data transformation and manipulation suddenly takes us
>>>  >> into a
>>>  >>>>> place where Stellar starts to look like Pig or Spark RDD API. I
>>>  >> wasn't
>>>  >>>>> ready for that without a lot more discussion.
>>>  >>>>>
>>>  >>>>> To summarize, what I'd like to get from the community is, after
>>>  >>> reviewing
>>>  >>>>> the entire use-case at https://github.com/cestella/
>>>  >>>> incubator-metron/tree/
>>>  >>>>> typosquat_merge/use-cases/typosquat_detection:
>>>  >>>>>
>>>  >>>>> - Is this so confusing that it does not belong in Metron even as a
>>>  >>>>> first-step?
>>>  >>>>> - Is there a way to extend the extractor config in a less
>>>  >> confusing
>>>  >>>>> way to enable this?
>>>  >>>>>
>>>  >>>>> I apologize for making the discuss thread *after* the JIRAs, but I
>>>  >> felt
>>>  >>>>> this one might bear having some working code to consider.
>>>  >>>>>
>>>  >>>>
>>>  >>>
>>>  >>

------------------- 
Thank you,

James Sirota
PMC- Apache Metron
jsirota AT apache DOT org

Re: [DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to