Re: [DISCUSS] Generating and Interacting with serialized summary objects

Michael Miklavcic Fri, 05 Jan 2018 13:10:06 -0800

Any volunteers for creating a set of jiras and feature branch for an object
store repository? This sounds like a massive feature.


On Jan 5, 2018 2:06 PM, "Otto Fowler" <ottobackwa...@gmail.com> wrote:

> I would say that at the stellar author level, you would just get objects
> from the store and the ‘override’ case would be a follow on for edge cases.
>
>
> On January 5, 2018 at 14:29:16, Casey Stella (ceste...@gmail.com) wrote:
>
> Well, you can pull the default configs from global configs, but you might
> want to override them (similar to the profiler).  For instance, you might
> want to interact with another hbase table than the one globally configured.
>
> On Fri, Jan 5, 2018 at 12:04 PM, Otto Fowler <ottobackwa...@gmail.com>
> wrote:
>
> > I would imagine the ‘stellar-object-repo’ would be part of the global
> > configuration or configuration passed to the command.
> > why specify in the function itself?
> >
> >
> >
> >
> > On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote:
> >
> > I like that, specifically the repositories abstraction. Perhaps we can
> > construct some longer term JIRAs for extensions.
> > For the current state of affairs (wrt to the OBJECT_GET call) I was
> > imagining the simple default HDFS solution as a first cut and
> > following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
> > with repo_name being optional and defaulting to HDFS
> > for backwards compatibility.
> >
> > In effect, this would be the next step that I'm proposing
> OBJECT_GET(paths,
> > repo_name, repo_config) which would be backwards compatible
> >
> > - paths - a single path or a list of paths (if a list, then a list of
> > objects returned)
> > - repo_name - optional name for repo, defaulted to HDFS if we don't
> > specify
> > - repo_config - optional config map
> >
> >
> > This would open things like:
> >
> > - OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
> > 'cf'} ) -- pulling from HBase
> >
> > Eventually we might also be able to fold ENRICHMENT_GET as just a special
> > repo instance.
> >
> > On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler <ottobackwa...@gmail.com>
> > wrote:
> >
> > > If we separate the concerns as I have state previously :
> > >
> > > 1. Stellar can load objects into ‘caches’ from some repository and
> refer
> > to
> > > them.
> > > 2. The repositories
> > > 3. Some number of strategies to populate and possibly update the
> > > repository, from spark,
> > > to MR jobs to whatever you would classify the flat file stuff as.
> > > 4. Let the Stellar API for everything but LOAD() follow after we get
> > usage
> > >
> > > Then the particulars of ‘3’ are less important.
> > >
> > >
> > >
> > > On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com)
> > wrote:
> > >
> > > I agree with the general sentiment that we can tailor specific use
> cases
> > > via UI, and I'm worried that the use case specific solution
> (particularly
> > > in light of the note that it's not even general to the class of bloom
> > > filter problems, let alone an actually general problem) becomes more
> work
> > > than this as soon as about 2 more uses cases actually get realized.
> > > Pushing that to the UI lets people solve a variety of problems if they
> > > really want to dig in, while still giving flexibility to provide a more
> > > tailored experience for what we discover the 80% cases are in practice.
> > >
> > > Keeping in mind I am mostly unfamiliar with the extractor config
> itself,
> > I
> > > am wondering if it makes sense to split up the config a bit. While a
> lot
> > > of implementation details are shared, maybe the extractor config itself
> > > should be refactored into a couple parts analogous to ETL (as a follow
> on
> > > task, I think if this is true, it predates Casey's proposed change). It
> > > doesn't necessarily make it less complex, but it might make it more
> > easily
> > > digestible if it's split up by idea (parsing, transformation, etc.).
> > >
> > > Re: Mike's point, I don't think we want the actual processing broken up
> > as
> > > ETL, but the representation to the user in terms of configuration could
> > be
> > > similar (Since we're already doing parsing and transformation). We
> don't
> > > have to implement it as an ETL pipeline, but it does potentially offer
> > the
> > > user a way to quickly grasp what the JSON blob is actually specifying.
> > > Making it easy to understand, even if it's not the ideal way to
> interact
> > is
> > > potentially still a win.
> > >
> > > On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> > > michael.miklav...@gmail.com> wrote:
> > >
> > > > I mentioned this earlier, but I'll reiterate that I think this
> approach
> > > > gives us the ability to make specific use cases via a UI, or other
> > > > interface should we choose to add one, while keeping the core
> adaptable
> > > and
> > > > flexible. This is ideal for middle tier as I think this effectively
> > gives
> > > > us the ability to pivot to other use cases very easily while not
> being
> > so
> > > > generic as to be useless. The fact that you were able to create this
> as
> > > > quickly as you did seems to me directly related to the fact we made
> the
> > > > decision to keep the loader somewhat flexible rather than very
> > specific.
> > > > The operation ordering and state carry from one phase of processing
> to
> > > the
> > > > next would simply have been inscrutable, if not impossible, with a
> CLI
> > > > option-only approach. Sure, it's not as simple as "put infile.txt
> > > > outfile.txt", but the alternatives are not that clear either. One
> might
> > > > argue we could split up the processing pieces as in traditional
> Hadoop,
> > > eg
> > > > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark
> transform.
> > > But
> > > > quite frankly that's going in the *opposite* direction I think we
> want
> > > > here. That's more complex in terms of moving parts. The config
> approach
> > > > with pluggable Stellar insulates users from specific implementations,
> > but
> > > > also gives you the ability to pass lower level constructs, eg Spark
> SQL
> > > or
> > > > HiveQL, should the need arise.
> > > >
> > > > In summary, my impressions are that at this point the features and
> > level
> > > of
> > > > abstraction feel appropriate to me. I think it buys us 1) learning
> > from a
> > > > starting typosquatting use case, 2) flexibility to change and adapt
> it
> > > > without affecting users, and 3) enough concrete capability to make
> more
> > > > specific use cases easy to deliver with a UI.
> > > >
> > > > Cheers,
> > > > Mike
> > > >
> > > > On Jan 4, 2018 9:59 AM, "Casey Stella" <ceste...@gmail.com> wrote:
> > > >
> > > > > It also occurs to me that even in this situation, it's not a
> > sufficient
> > > > > generalization for just Bloom, but this is a bloom filter of the
> > output
> > > > of
> > > > > the all the typosquatted domains for the domain in each row. If we
> > > > wanted
> > > > > to hard code, we'd have to hard code specifically the bloom filter
> > > *for*
> > > > > typosquatting use-case. Hard coding this would prevent things like
> > > bloom
> > > > > filters containing malicious IPs from a reference source, for
> > instance.
> > > > >
> > > > > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella <ceste...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > So, there is value outside of just bloom usage. The most specific
> > > > > example
> > > > > > of this would be in order to configure a bloom filter, we need to
> > > know
> > > > at
> > > > > > least an upper bound of the number of items that are going to be
> > > added
> > > > to
> > > > > > the bloom filter. In order to do that, we need to count the
> number
> > of
> > > > > > typosquatted domains. Specifically at https://github.com/
> > > > > > cestella/incubator-metron/tree/typosquat_merge/use-
> > > > > > cases/typosquat_detection#configure-the-bloom-filter you can see
> > how
> > > > we
> > > > > > use the CONSOLE writer with an extractor config to count the
> number
> > > of
> > > > > > typosquatted domains in the alexa top 10k dataset so we can size
> > the
> > > > > filter
> > > > > > appropriately.
> > > > > >
> > > > > > I'd argue that other types of probabalistic data structures could
> > > also
> > > > > > make sense here as well, like statistical sketches. Consider, for
> > > > > instance,
> > > > > > a cheap and dirty DGA indicator where we take the Alexa top 1M
> and
> > > look
> > > > > at
> > > > > > the distribution of shannon entropy in the domains. If the
> shannon
> > > > > entropy
> > > > > > of a domain going across metron is more than 5 std devs from the
> > > mean,
> > > > > that
> > > > > > could be circumstantial evidence of a malicious attack. This
> would
> > > > > yield a
> > > > > > lot of false positives, but used in conjunction with other
> > indicators
> > > > it
> > > > > > could be valuable.
> > > > > >
> > > > > > Computing that would be as follows:
> > > > > >
> > > > > > {
> > > > > > "config" : {
> > > > > > "columns" : {
> > > > > > "rank" : 0,
> > > > > > "domain" : 1
> > > > > > },
> > > > > > "value_transform" : {
> > > > > > "domain" : "DOMAIN_REMOVE_TLD(domain)"
> > > > > > },
> > > > > > "value_filter" : "LENGTH(domain) > 0",
> > > > > > "state_init" : "STATS_INIT()",
> > > > > > "state_update" : {
> > > > > > "state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
> > > > > > },
> > > > > > "state_merge" : "STATS_MERGE(states)",
> > > > > > "separator" : ","
> > > > > > },
> > > > > > "extractor" : "CSV"
> > > > > > }
> > > > > >
> > > > > > Also, for another example, imagine a situation where we have a
> > > > SPARK_SQL
> > > > > > engine rather than just LOCAL for summarizing. We could create a
> > > > general
> > > > > > summary of URL lengths in bro data which could be used for
> > > determining
> > > > if
> > > > > > someone is trying to send in very large URLs maliciously (see Jon
> > > > > Zeolla's
> > > > > > concerns in https://issues.apache.org/jira/browse/METRON-517
> for a
> > > > > > discussion of this). In order to do that, we could simply
> execute:
> > > > > >
> > > > > > $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro"
> > -o
> > > > > /tmp/reference/bro_uri_distribution.ser -e
> > ~/uri_length_extractor.json
> > > > -p
> > > > > 5 -om HDFS -m SPARK_SQL
> > > > > >
> > > > > > with uri_length_extractor.json containing:
> > > > > >
> > > > > > {
> > > > > > "config" : {
> > > > > > "value_filter" : "LENGTH(uri) > 0",
> > > > > > "state_init" : "STATS_INIT()",
> > > > > > "state_update" : {
> > > > > > "state" : "STATS_ADD(state, LENGTH(uri))"
> > > > > > },
> > > > > > "state_merge" : "STATS_MERGE(states)",
> > > > > > "separator" : ","
> > > > > > },
> > > > > > "extractor" : "SQL_ROW"
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Regarding value filter, that's already around in the extractor
> > config
> > > > > > because of the need to transform data in the flatfile loader.
> > While I
> > > > > > definitely see the desire to use unix tools to prep data, there
> are
> > > > some
> > > > > > things that aren't as easy to do. For instance, here, removing
> the
> > > TLD
> > > > > of
> > > > > > a domain is not a trivial task in a shell script and we have
> > existing
> > > > > > functions for that in Stellar. I would see people using both.
> > > > > >
> > > > > > To address the issue of a more targeted experience to bloom, I
> > think
> > > > that
> > > > > > sort of specialization should best exist in the UI layer. Having
> a
> > > > more
> > > > > > complete and expressive backend reused across specific UIs seems
> to
> > > be
> > > > > the
> > > > > > best of all worlds. It allows power users to drop down and do
> more
> > > > > complex
> > > > > > things and still provides a (mostly) code-free and targeted
> > > experience
> > > > > for
> > > > > > users. It seems to me that limiting the expressibility in the
> > backend
> > > > > > isn't the right way to go since this work just fits in with our
> > > > existing
> > > > > > engine.
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 4, 2018 at 1:40 AM, James Sirota <jsir...@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > >> I just went through these pull requests as well and also agree
> > this
> > > is
> > > > > >> good work. I think it's a good first pass. I would be careful
> with
> > > > > trying
> > > > > >> to boil the ocean here. I think for the initial use case I would
> > > only
> > > > > >> support loading the bloom filters from HDFS. If people want to
> > > > > pre-process
> > > > > >> the CSV file of domains using awk or sed this should be out of
> > scope
> > > > of
> > > > > >> this work. It's easy enough to do out of band and I would not
> > > include
> > > > > any
> > > > > >> of these functions at all. I also think that the config could be
> > > > > >> considerably simplified. I think value_filter should be removed
> > > > (since
> > > > > I
> > > > > >> believe that preprocessing should be done by the user outside of
> > > this
> > > > > >> process). I also have a question about the init, update, and
> merge
> > > > > >> configurations. Would I ever initialize to anything but an empty
> > > > bloom
> > > > > >> filter? For the state update would I ever do anything other than
> > add
> > > > to
> > > > > >> the bloom filter? For the state merge would I ever do anything
> > other
> > > > > than
> > > > > >> merge the states? If the answer to these is 'no', then this
> should
> > > > > simply
> > > > > >> be hard coded and not externalized into config values.
> > > > > >>
> > > > > >> 03.01.2018, 14:20, "Michael Miklavcic" <
> > michael.miklav...@gmail.com
> > > >:
> > >
> > > > > >> > I just finished stepping through the typosquatting use case
> > README
> > > > in
> > > > > >> your
> > > > > >> > merge branch. This is really, really good work Casey. I see
> most
> > > of
> > > > > our
> > > > > >> > previous documentation issues addressed up front, e.g. special
> > > > > variables
> > > > > >> > are cited, all new fields explained, side effects documented.
> > The
> > > > use
> > > > > >> case
> > > > > >> > doc brings it all together soup-to-nuts and I think all the
> > pieces
> > > > > make
> > > > > >> > sense in a mostly self-contained way. I can't think of
> anything
> > I
> > > > had
> > > > > to
> > > > > >> > sit and think about for more than a few seconds. I'll be
> making
> > my
> > > > way
> > > > > >> > through your individual PR's in more detail, but my first
> > > > impressions
> > > > > >> are
> > > > > >> > that this is excellent.
> > > > > >> >
> > > > > >> > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
> > > > > >> > michael.miklav...@gmail.com> wrote:
> > > > > >> >
> > > > > >> >> I'm liking this design and growth strategy, Casey. I also
> think
> > > > Nick
> > > > > >> and
> > > > > >> >> Otto have some valid points. I always find there's a natural
> > > > tension
> > > > > >> >> between too little, just enough, and boiling the ocean and
> > these
> > > > > >> discuss
> > > > > >> >> threads really help drive what the short and long term
> visions
> > > > > should
> > > > > >> look
> > > > > >> >> like.
> > > > > >> >>
> > > > > >> >> On the subject of repositories and strategies, I agree that
> > > > > pluggable
> > > > > >> >> repos and strategies for modifying them would be useful. For
> > the
> > > > > first
> > > > > >> >> pass, I'd really like to see HDFS with the proposed set of
> > > Stellar
> > > > > >> >> functions. This gives us a lot of bang for our buck - we can
> > > > > >> capitalize on
> > > > > >> >> a set of powerful features around existence checking earlier
> > > > without
> > > > > >> having
> > > > > >> >> to worry about later interface changes impacting users. With
> > the
> > > > > >> primary
> > > > > >> >> interface coming through the JSON config, we are building a
> > nice
> > > > > >> facade
> > > > > >> >> that protects users from later implementation abstractions
> and
> > > > > >> >> improvements, all while providing a stable enough interface
> on
> > > > which
> > > > > >> we can
> > > > > >> >> develop UI features as desired. I'd be interested to hear
> more
> > > > about
> > > > > >> what
> > > > > >> >> features could be provided by a repository as time goes by.
> > > > > >> Federation,
> > > > > >> >> permissions, governance, metadata management, perhaps?
> > > > > >> >>
> > > > > >> >> I also had some concern over duplicating existing Unix
> > features.
> > > I
> > > > > >> think
> > > > > >> >> where I'm at has been largely addressed by Casey's comments
> on
> > 1)
> > > > > >> scaling,
> > > > > >> >> 2) multiple variables, and 3) portability to Hadoop.
> Providing
> > 2
> > > > > >> approaches
> > > > > >> >> - 1 which is config-based and the other a composable set of
> > > > > functions
> > > > > >> gives
> > > > > >> >> us the ability to provide a core set of features that can
> later
> > > be
> > > > > >> easily
> > > > > >> >> expanded by users as the need arises. Here again I think the
> > > > > >> prescribed
> > > > > >> >> approach provides a strong first pass that we can then expand
> > on
> > > > > >> without
> > > > > >> >> concern of future improvements becoming a hassle for end
> users.
> > > > > >> >>
> > > > > >> >> Best,
> > > > > >> >> Mike
> > > > > >> >>
> > > > > >> >> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
> > > > > >> >> si...@simonellistonball.com> wrote:
> > > > > >> >>
> > > > > >> >>> There is some really cool stuff happening here, if only I’d
> > been
> > > > > >> allowed
> > > > > >> >>> to see the lists over Christmas... :)
> > > > > >> >>>
> > > > > >> >>> A few thoughts...
> > > > > >> >>>
> > > > > >> >>> I like Otto’s generalisation of the problem to include
> > specific
> > > > > local
> > > > > >> >>> stellar objects in a cache loaded from a store (HDFS seems a
> > > > > >> natural, but
> > > > > >> >>> not only place, maybe even a web service / local
> microservicey
> > > > > object
> > > > > >> >>> provider!?) That said, I suspect that’s a good platform
> > > > > optimisation
> > > > > >> >>> approach. Should we look at this as a separate piece of work
> > > > given
> > > > > it
> > > > > >> >>> extends beyond the scope of the summarisation concept and
> > > > > ultimately
> > > > > >> use it
> > > > > >> >>> as a back-end to feed the summarising engine proposed here
> for
> > > > the
> > > > > >> >>> enrichment loader?
> > > > > >> >>>
> > > > > >> >>> On the more specific use case, one think I would comment on
> is
> > > > the
> > > > > >> >>> configuration approach. The iteration loop
> > > > > (state_{init|update|merge}
> > > > > >> >>> should be consistent with the way we handle things like the
> > > > > profiler
> > > > > >> >>> config, since it’s the same approach to data handling.
> > > > > >> >>>
> > > > > >> >>> The other thing that seems to have crept in here is the
> > > interface
> > > > > to
> > > > > >> >>> something like Spark, which again, I am really very very
> keen
> > on
> > > > > >> seeing
> > > > > >> >>> happen. That said, not sure how that would happen in this
> > > > context,
> > > > > >> unless
> > > > > >> >>> you’re talking about pushing to something like livy for
> > example
> > > > > >> (eminently
> > > > > >> >>> sensible for things like cross instance caching and faster
> > > > RPC-ish
> > > > > >> access
> > > > > >> >>> to an existing spark context which seem to be what Casey is
> > > > driving
> > > > > >> at with
> > > > > >> >>> the spark piece.
> > > > > >> >>>
> > > > > >> >>> To address the question of text manipulation in Stellar /
> > metron
> > > > > >> >>> enrichment ingest etc, we already have this outside of the
> > > > context
> > > > > >> of the
> > > > > >> >>> issues here. I would argue that yes, we don’t want too many
> > > paths
> > > > > >> for this,
> > > > > >> >>> and that maybe our parser approach might be heavily related
> to
> > > > > >> text-based
> > > > > >> >>> ingest. I would say the scope worth dealing with here though
> > is
> > > > not
> > > > > >> really
> > > > > >> >>> text manipulation, but summarisation, which is not well
> served
> > > by
> > > > > >> existing
> > > > > >> >>> CLI tools like awk / sed and friends.
> > > > > >> >>>
> > > > > >> >>> Simon
> > > > > >> >>>
> > > > > >> >>> > On 3 Jan 2018, at 15:48, Nick Allen <n...@nickallen.org>
> > > > wrote:
> > > > > >> >>> >
> > > > > >> >>> >> Even with 5 threads, it takes an hour for the full Alexa
> > 1m,
> > > > so
> > > > > I
> > > > > >> >>> think
> > > > > >> >>> > this will impact performance
> > > > > >> >>> >
> > > > > >> >>> > What exactly takes an hour? Adding 1M entries to a bloom
> > > > filter?
> > > > > >> That
> > > > > >> >>> > seems really high, unless I am not understanding
> something.
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <
> > > > > ceste...@gmail.com>
> > > > > >> >>> wrote:
> > > > > >> >>> >
> > > > > >> >>> >> Thanks for the feedback, Nick.
> > > > > >> >>> >>
> > > > > >> >>> >> Regarding "IMHO, I'd rather not reinvent the wheel for
> text
> > > > > >> >>> manipulation."
> > > > > >> >>> >>
> > > > > >> >>> >> I would argue that we are not reinventing the wheel for
> > text
> > > > > >> >>> manipulation
> > > > > >> >>> >> as the extractor config exists already and we are doing a
> > > > > similar
> > > > > >> >>> thing in
> > > > > >> >>> >> the flatfile loader (in fact, the code is reused and
> merely
> > > > > >> extended).
> > > > > >> >>> >> Transformation operations are already supported in our
> > > > codebase
> > > > > >> in the
> > > > > >> >>> >> extractor config, this PR has just added some hooks for
> > > > stateful
> > > > > >> >>> >> operations.
> > > > > >> >>> >>
> > > > > >> >>> >> Furthermore, we will need a configuration object to pass
> to
> > > > the
> > > > > >> REST
> > > > > >> >>> call
> > > > > >> >>> >> if we are ever to create a UI around importing data into
> > > hbase
> > > > > or
> > > > > >> >>> creating
> > > > > >> >>> >> these summary objects.
> > > > > >> >>> >>
> > > > > >> >>> >> Regarding your example:
> > > > > >> >>> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' |
> > > > stellar
> > > > > -i
> > > > > >> >>> >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > > > >> >>> >>
> > > > > >> >>> >> I'm very sympathetic to this type of extension, but it
> has
> > > > some
> > > > > >> issues:
> > > > > >> >>> >>
> > > > > >> >>> >> 1. This implies a single-threaded addition to the bloom
> > > > filter.
> > > > > >> >>> >> 1. Even with 5 threads, it takes an hour for the full
> alexa
> > > > 1m,
> > > > > >> >>> so I
> > > > > >> >>> >> think this will impact performance
> > > > > >> >>> >> 2. There's not a way to specify how to merge across
> threads
> > > if
> > > > > we
> > > > > >> >>> do
> > > > > >> >>> >> make a multithread command line option
> > > > > >> >>> >> 2. This restricts these kinds of operations to roles with
> > > > heavy
> > > > > >> unix
> > > > > >> >>> CLI
> > > > > >> >>> >> knowledge, which isn't often the types of people who
> would
> > be
> > > > > >> doing
> > > > > >> >>> this
> > > > > >> >>> >> type of operation
> > > > > >> >>> >> 3. What if we need two variables passed to stellar?
> > > > > >> >>> >> 4. This approach will be harder to move to Hadoop.
> > Eventually
> > > > we
> > > > > >> >>> will
> > > > > >> >>> >> want to support data on HDFS being processed by Hadoop
> > > > (similar
> > > > > to
> > > > > >> >>> >> flatfile
> > > > > >> >>> >> loader), so instead of -m LOCAL being passed for the
> > flatfile
> > > > > >> >>> summarizer
> > > > > >> >>> >> you'd pass -m SPARK and the processing would happen on
> the
> > > > > cluster
> > > > > >> >>> >> 1. This is particularly relevant in this case as it's a
> > > > > >> >>> >> embarrassingly parallel problem in general
> > > > > >> >>> >>
> > > > > >> >>> >> In summary, while this a CLI approach is attractive, I
> > prefer
> > > > > the
> > > > > >> >>> extractor
> > > > > >> >>> >> config solution because it is the solution with the
> > smallest
> > > > > >> iteration
> > > > > >> >>> >> that:
> > > > > >> >>> >>
> > > > > >> >>> >> 1. Reuses existing metron extraction infrastructure
> > > > > >> >>> >> 2. Provides the most solid base for the extensions that
> > will
> > > > be
> > > > > >> >>> sorely
> > > > > >> >>> >> needed soon (and will keep it in parity with the flatfile
> > > > > loader)
> > > > > >> >>> >> 3. Provides the most solid base for a future UI extension
> > in
> > > > the
> > > > > >> >>> >> management UI to support both summarization and loading
> > > > > >> >>> >>
> > > > > >> >>> >>
> > > > > >> >>> >>
> > > > > >> >>> >>
> > > > > >> >>> >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <
> > > > > n...@nickallen.org>
> > > > > >> >>> wrote:
> > > > > >> >>> >>
> > > > > >> >>> >>> First off, I really do like the typosquatting use case
> > and a
> > > > > lot
> > > > > >> of
> > > > > >> >>> what
> > > > > >> >>> >>> you have described.
> > > > > >> >>> >>>
> > > > > >> >>> >>>> We need a way to generate the summary sketches from
> flat
> > > > data
> > > > > >> for
> > > > > >> >>> this
> > > > > >> >>> >> to
> > > > > >> >>> >>>> work.
> > > > > >> >>> >>>> ..
> > > > > >> >>> >>>>
> > > > > >> >>> >>>
> > > > > >> >>> >>> I took this quote directly from your use case. Above is
> > the
> > > > > point
> > > > > >> >>> that
> > > > > >> >>> >> I'd
> > > > > >> >>> >>> like to discuss and what your proposed solutions center
> > on.
> > > > > This
> > > > > >> is
> > > > > >> >>> >> what I
> > > > > >> >>> >>> think you are trying to do, at least with PR #879
> > > > > >> >>> >>> <https://github.com/apache/metron/pull/879>...
> > > > > >> >>> >>>
> > > > > >> >>> >>> (Q) Can we repurpose Stellar functions so that they can
> > > > operate
> > > > > >> on
> > > > > >> >>> text
> > > > > >> >>> >>> stored in a file system?
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>> Whether we use the (1) Configuration or the (2)
> > > > Function-based
> > > > > >> >>> approach
> > > > > >> >>> >>> that you described, fundamentally we are introducing new
> > > ways
> > > > > to
> > > > > >> >>> perform
> > > > > >> >>> >>> text manipulation inside of Stellar.
> > > > > >> >>> >>>
> > > > > >> >>> >>> IMHO, I'd rather not reinvent the wheel for text
> > > > manipulation.
> > > > > It
> > > > > >> >>> would
> > > > > >> >>> >> be
> > > > > >> >>> >>> painful to implement and maintain a bunch of Stellar
> > > > functions
> > > > > >> for
> > > > > >> >>> text
> > > > > >> >>> >>> manipulation. People already have a large number of
> tools
> > > > > >> available
> > > > > >> >>> to
> > > > > >> >>> >> do
> > > > > >> >>> >>> this and everyone has their favorites. People are
> > resistant
> > > > to
> > > > > >> >>> learning
> > > > > >> >>> >>> something new when they already are familiar with
> another
> > > way
> > > > > to
> > > > > >> do
> > > > > >> >>> the
> > > > > >> >>> >>> same thing.
> > > > > >> >>> >>>
> > > > > >> >>> >>> So then the question is, how else can we do this? My
> > > > suggestion
> > > > > >> is
> > > > > >> >>> that
> > > > > >> >>> >>> rather than introducing text manipulation tools inside
> of
> > > > > >> Stellar, we
> > > > > >> >>> >> allow
> > > > > >> >>> >>> people to use the text manipulation tools they already
> > know,
> > > > > but
> > > > > >> with
> > > > > >> >>> the
> > > > > >> >>> >>> Stellar functions that we already have. And the obvious
> > way
> > > > to
> > > > > >> tie
> > > > > >> >>> those
> > > > > >> >>> >>> two things together is the Unix pipeline.
> > > > > >> >>> >>>
> > > > > >> >>> >>> A quick, albeit horribly incomplete, example to flesh
> this
> > > > out
> > > > > a
> > > > > >> bit
> > > > > >> >>> more
> > > > > >> >>> >>> based on the example you have in PR #879
> > > > > >> >>> >>> <https://github.com/apache/metron/pull/879>. This would
> > > > allow
> > > > > >> me to
> > > > > >> >>> >>> integrate Stellar with whatever external tools that I
> > want.
> > > > > >> >>> >>>
> > > > > >> >>> >>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' |
> > > > stellar
> > > > > >> -i
> > > > > >> >>> >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>>
> > > > > >> >>> >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <
> > > > > >> ceste...@gmail.com>
> > > > > >> >>> >> wrote:
> > > > > >> >>> >>>
> > > > > >> >>> >>>> I'll start this discussion off with my idea around a
> 2nd
> > > > step
> > > > > >> that is
> > > > > >> >>> >>> more
> > > > > >> >>> >>>> adaptable. I propose the following set of stellar
> > functions
> > > > > >> backed
> > > > > >> >>> by
> > > > > >> >>> >>>> Spark in the metron-management project:
> > > > > >> >>> >>>>
> > > > > >> >>> >>>> - CSV_PARSE(location, separator?, columns?) :
> Constructs
> > a
> > > > > Spark
> > > > > >> >>> >>>> Dataframe for reading the flatfile
> > > > > >> >>> >>>> - SQL_TRANSFORM(dataframe, spark sql statement):
> > Transforms
> > > > > the
> > > > > >> >>> >>>> dataframe
> > > > > >> >>> >>>> - SUMMARIZE(state_init, state_update, state_merge):
> > > > Summarize
> > > > > >> the
> > > > > >> >>> >>>> dataframe using the lambda functions:
> > > > > >> >>> >>>> - state_init - executed once per worker to initialize
> the
> > > > > state
> > > > > >> >>> >>>> - state_update - executed once per row
> > > > > >> >>> >>>> - state_merge - Merge the worker states into one worker
> > > > state
> > > > > >> >>> >>>> - OBJECT_SAVE(obj, output_path) : Save the object obj
> to
> > > the
> > > > > >> path
> > > > > >> >>> >>>> output_path on HDFS.
> > > > > >> >>> >>>>
> > > > > >> >>> >>>> This would enable more flexibility and composibility
> than
> > > > the
> > > > > >> >>> >>>> configuration-based approach that we have in the
> flatfile
> > > > > >> loader.
> > > > > >> >>> >>>> My concern with this approach, and the reason I didn't
> do
> > > it
> > > > > >> >>> initially,
> > > > > >> >>> >>> was
> > > > > >> >>> >>>> that I think that users will want at least 2 ways to
> > > > summarize
> > > > > >> data
> > > > > >> >>> (or
> > > > > >> >>> >>>> load data):
> > > > > >> >>> >>>>
> > > > > >> >>> >>>> - A configuration based approach, which enables a UI
> > > > > >> >>> >>>> - A set of stellar functions via the scriptable REPL
> > > > > >> >>> >>>>
> > > > > >> >>> >>>> I would argue that both have a place and I started with
> > the
> > > > > >> >>> >> configuration
> > > > > >> >>> >>>> based approach as it was a more natural extension of
> what
> > > we
> > > > > >> already
> > > > > >> >>> >> had.
> > > > > >> >>> >>>> I'd love to hear thoughts about this idea too.
> > > > > >> >>> >>>>
> > > > > >> >>> >>>>
> > > > > >> >>> >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <
> > > > > >> ceste...@gmail.com>
> > > > > >> >>> >>> wrote:
> > > > > >> >>> >>>>
> > > > > >> >>> >>>>> Hi all,
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> I wanted to get some feedback on a sensible plan for
> > > > > >> something. It
> > > > > >> >>> >>>>> occurred to me the other day when considering the
> > use-case
> > > > of
> > > > > >> >>> >> detecting
> > > > > >> >>> >>>>> typosquatted domains, that one approach was to
> generate
> > > the
> > > > > >> set of
> > > > > >> >>> >>>>> typosquatted domains for some set of reference domains
> > and
> > > > > >> compare
> > > > > >> >>> >>>> domains
> > > > > >> >>> >>>>> as they flow through.
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> One way we could do this would be to generate this
> data
> > > and
> > > > > >> import
> > > > > >> >>> >> the
> > > > > >> >>> >>>>> typosquatted domains into HBase. I thought, however,
> > that
> > > > > >> another
> > > > > >> >>> >>>> approach
> > > > > >> >>> >>>>> which may trade-off accuracy to remove the network hop
> > and
> > > > > >> potential
> > > > > >> >>> >>> disk
> > > > > >> >>> >>>>> seek by constructing a bloom filter that includes the
> > set
> > > > of
> > > > > >> >>> >>> typosquatted
> > > > > >> >>> >>>>> domains.
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> The challenge was that we don't have a way to do this
> > > > > >> currently. We
> > > > > >> >>> >>> do,
> > > > > >> >>> >>>>> however, have a loading infrastructure (e.g. the
> > > > > >> flatfile_loader)
> > > > > >> >>> and
> > > > > >> >>> >>>>> configuration (see https://github.com/apache/
> > > > > >> >>> >>> metron/tree/master/metron-
> > > > > >> >>> >>>>> platform/metron-data-management#common-extractor-
> > > > properties)
> > > > > >> which
> > > > > >> >>> >>>>> handles:
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> - parsing flat files
> > > > > >> >>> >>>>> - transforming the rows
> > > > > >> >>> >>>>> - filtering the rows
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> To enable the new use-case of generating a summary
> > object
> > > > > >> (e.g. a
> > > > > >> >>> >> bloom
> > > > > >> >>> >>>>> filter), in METRON-1378 (
> https://github.com/apache/met
> > > > > >> ron/pull/879)
> > > > > >> >>> >> I
> > > > > >> >>> >>>>> propose that we create a new utility that uses the
> same
> > > > > >> extractor
> > > > > >> >>> >>> config
> > > > > >> >>> >>>>> add the ability to:
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> - initialize a state object
> > > > > >> >>> >>>>> - update the object for every row
> > > > > >> >>> >>>>> - merge the state objects (in the case of multiple
> > > threads,
> > > > > in
> > > > > >> the
> > > > > >> >>> >>>>> case of one thread it's not needed).
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> I think this is a sensible decision because:
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> - It's a minimal movement from the flat file loader
> > > > > >> >>> >>>>> - Uses the same configs
> > > > > >> >>> >>>>> - Abstracts and reuses the existing infrastructure
> > > > > >> >>> >>>>> - Having one extractor config means that it should be
> > > > easier
> > > > > to
> > > > > >> >>> >>>>> generate a UI around this to simplify the experience
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> All that being said, our extractor config is..shall we
> > > > > >> >>> say...daunting
> > > > > >> >>> >>> :).
> > > > > >> >>> >>>>> I am sensitive to the fact that this adds to an
> existing
> > > > > >> difficult
> > > > > >> >>> >>>> config.
> > > > > >> >>> >>>>> I propose that this is an initial step forward to
> > support
> > > > the
> > > > > >> >>> >> use-case
> > > > > >> >>> >>>> and
> > > > > >> >>> >>>>> we can enable something more composable going forward.
> > My
> > > > > >> concern
> > > > > >> >>> in
> > > > > >> >>> >>>>> considering this as the first step was that it felt
> that
> > > > the
> > > > > >> >>> >> composable
> > > > > >> >>> >>>>> units for data transformation and manipulation
> suddenly
> > > > takes
> > > > > >> us
> > > > > >> >>> >> into a
> > > > > >> >>> >>>>> place where Stellar starts to look like Pig or Spark
> RDD
> > > > > API. I
> > > > > >> >>> >> wasn't
> > > > > >> >>> >>>>> ready for that without a lot more discussion.
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> To summarize, what I'd like to get from the community
> > is,
> > > > > after
> > > > > >> >>> >>> reviewing
> > > > > >> >>> >>>>> the entire use-case at https://github.com/cestella/
> > > > > >> >>> >>>> incubator-metron/tree/
> > > > > >> >>> >>>>> typosquat_merge/use-cases/typosquat_detection:
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> - Is this so confusing that it does not belong in
> Metron
> > > > even
> > > > > >> as a
> > > > > >> >>> >>>>> first-step?
> > > > > >> >>> >>>>> - Is there a way to extend the extractor config in a
> > less
> > > > > >> >>> >> confusing
> > > > > >> >>> >>>>> way to enable this?
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>> I apologize for making the discuss thread *after* the
> > > > JIRAs,
> > > > > >> but I
> > > > > >> >>> >> felt
> > > > > >> >>> >>>>> this one might bear having some working code to
> > consider.
> > > > > >> >>> >>>>>
> > > > > >> >>> >>>>
> > > > > >> >>> >>>
> > > > > >> >>> >>
> > > > > >>
> > > > > >> -------------------
> > > > > >> Thank you,
> > > > > >>
> > > > > >> James Sirota
> > > > > >> PMC- Apache Metron
> > > > > >> jsirota AT apache DOT org
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>

Re: [DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to