Re: [DISCUSS] Generating and Interacting with serialized summary objects

Casey Stella Fri, 05 Jan 2018 08:23:37 -0800

I like that, specifically the repositories abstraction.  Perhaps we can
construct some longer term JIRAs for extensions.
For the current state of affairs (wrt to the OBJECT_GET call) I was
imagining the simple default HDFS solution as a first cut and
following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
with repo_name being optional and defaulting to HDFS
for backwards compatibility.


In effect, this would be the next step that I'm proposing OBJECT_GET(paths,
repo_name, repo_config) which would be backwards compatible

   - paths - a single path or a list of paths (if a list, then a list of
   objects returned)
   - repo_name - optional name for repo, defaulted to HDFS if we don't
   specify
   - repo_config - optional config map


This would open things like:

   - OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
   'cf'} ) -- pulling from HBase

Eventually we might also be able to fold ENRICHMENT_GET as just a special
repo instance.

On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler <ottobackwa...@gmail.com>
wrote:

> If we separate the concerns as I have state previously :
>
> 1. Stellar can load objects into ‘caches’ from some repository and refer to
> them.
> 2. The repositories
> 3. Some number of strategies to populate and possibly update the
> repository, from spark,
> to MR jobs to whatever you would classify the flat file stuff as.
> 4. Let the Stellar API for everything but LOAD() follow after we get usage
>
> Then the particulars of ‘3’ are less important.
>
>
>
> On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com) wrote:
>
> I agree with the general sentiment that we can tailor specific use cases
> via UI, and I'm worried that the use case specific solution (particularly
> in light of the note that it's not even general to the class of bloom
> filter problems, let alone an actually general problem) becomes more work
> than this as soon as about 2 more uses cases actually get realized.
> Pushing that to the UI lets people solve a variety of problems if they
> really want to dig in, while still giving flexibility to provide a more
> tailored experience for what we discover the 80% cases are in practice.
>
> Keeping in mind I am mostly unfamiliar with the extractor config itself, I
> am wondering if it makes sense to split up the config a bit. While a lot
> of implementation details are shared, maybe the extractor config itself
> should be refactored into a couple parts analogous to ETL (as a follow on
> task, I think if this is true, it predates Casey's proposed change). It
> doesn't necessarily make it less complex, but it might make it more easily
> digestible if it's split up by idea (parsing, transformation, etc.).
>
> Re: Mike's point, I don't think we want the actual processing broken up as
> ETL, but the representation to the user in terms of configuration could be
> similar (Since we're already doing parsing and transformation). We don't
> have to implement it as an ETL pipeline, but it does potentially offer the
> user a way to quickly grasp what the JSON blob is actually specifying.
> Making it easy to understand, even if it's not the ideal way to interact is
> potentially still a win.
>
> On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > I mentioned this earlier, but I'll reiterate that I think this approach
> > gives us the ability to make specific use cases via a UI, or other
> > interface should we choose to add one, while keeping the core adaptable
> and
> > flexible. This is ideal for middle tier as I think this effectively gives
> > us the ability to pivot to other use cases very easily while not being so
> > generic as to be useless. The fact that you were able to create this as
> > quickly as you did seems to me directly related to the fact we made the
> > decision to keep the loader somewhat flexible rather than very specific.
> > The operation ordering and state carry from one phase of processing to
> the
> > next would simply have been inscrutable, if not impossible, with a CLI
> > option-only approach. Sure, it's not as simple as "put infile.txt
> > outfile.txt", but the alternatives are not that clear either. One might
> > argue we could split up the processing pieces as in traditional Hadoop,
> eg
> > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
> But
> > quite frankly that's going in the *opposite* direction I think we want
> > here. That's more complex in terms of moving parts. The config approach
> > with pluggable Stellar insulates users from specific implementations, but
> > also gives you the ability to pass lower level constructs, eg Spark SQL
> or
> > HiveQL, should the need arise.
> >
> > In summary, my impressions are that at this point the features and level
> of
> > abstraction feel appropriate to me. I think it buys us 1) learning from a
> > starting typosquatting use case, 2) flexibility to change and adapt it
> > without affecting users, and 3) enough concrete capability to make more
> > specific use cases easy to deliver with a UI.
> >
> > Cheers,
> > Mike
> >
> > On Jan 4, 2018 9:59 AM, "Casey Stella" <ceste...@gmail.com> wrote:
> >
> > > It also occurs to me that even in this situation, it's not a sufficient
> > > generalization for just Bloom, but this is a bloom filter of the output
> > of
> > > the all the typosquatted domains for the domain in each row. If we
> > wanted
> > > to hard code, we'd have to hard code specifically the bloom filter
> *for*
> > > typosquatting use-case. Hard coding this would prevent things like
> bloom
> > > filters containing malicious IPs from a reference source, for instance.
> > >
> > > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella <ceste...@gmail.com>
> > wrote:
> > >
> > > > So, there is value outside of just bloom usage. The most specific
> > > example
> > > > of this would be in order to configure a bloom filter, we need to
> know
> > at
> > > > least an upper bound of the number of items that are going to be
> added
> > to
> > > > the bloom filter. In order to do that, we need to count the number of
> > > > typosquatted domains. Specifically at https://github.com/
> > > > cestella/incubator-metron/tree/typosquat_merge/use-
> > > > cases/typosquat_detection#configure-the-bloom-filter you can see how
> > we
> > > > use the CONSOLE writer with an extractor config to count the number
> of
> > > > typosquatted domains in the alexa top 10k dataset so we can size the
> > > filter
> > > > appropriately.
> > > >
> > > > I'd argue that other types of probabalistic data structures could
> also
> > > > make sense here as well, like statistical sketches. Consider, for
> > > instance,
> > > > a cheap and dirty DGA indicator where we take the Alexa top 1M and
> look
> > > at
> > > > the distribution of shannon entropy in the domains. If the shannon
> > > entropy
> > > > of a domain going across metron is more than 5 std devs from the
> mean,
> > > that
> > > > could be circumstantial evidence of a malicious attack. This would
> > > yield a
> > > > lot of false positives, but used in conjunction with other indicators
> > it
> > > > could be valuable.
> > > >
> > > > Computing that would be as follows:
> > > >
> > > > {
> > > > "config" : {
> > > > "columns" : {
> > > > "rank" : 0,
> > > > "domain" : 1
> > > > },
> > > > "value_transform" : {
> > > > "domain" : "DOMAIN_REMOVE_TLD(domain)"
> > > > },
> > > > "value_filter" : "LENGTH(domain) > 0",
> > > > "state_init" : "STATS_INIT()",
> > > > "state_update" : {
> > > > "state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
> > > > },
> > > > "state_merge" : "STATS_MERGE(states)",
> > > > "separator" : ","
> > > > },
> > > > "extractor" : "CSV"
> > > > }
> > > >
> > > > Also, for another example, imagine a situation where we have a
> > SPARK_SQL
> > > > engine rather than just LOCAL for summarizing. We could create a
> > general
> > > > summary of URL lengths in bro data which could be used for
> determining
> > if
> > > > someone is trying to send in very large URLs maliciously (see Jon
> > > Zeolla's
> > > > concerns in https://issues.apache.org/jira/browse/METRON-517 for a
> > > > discussion of this). In order to do that, we could simply execute:
> > > >
> > > > $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o
> > > /tmp/reference/bro_uri_distribution.ser -e ~/uri_length_extractor.json
> > -p
> > > 5 -om HDFS -m SPARK_SQL
> > > >
> > > > with uri_length_extractor.json containing:
> > > >
> > > > {
> > > > "config" : {
> > > > "value_filter" : "LENGTH(uri) > 0",
> > > > "state_init" : "STATS_INIT()",
> > > > "state_update" : {
> > > > "state" : "STATS_ADD(state, LENGTH(uri))"
> > > > },
> > > > "state_merge" : "STATS_MERGE(states)",
> > > > "separator" : ","
> > > > },
> > > > "extractor" : "SQL_ROW"
> > > > }
> > > >
> > > >
> > > > Regarding value filter, that's already around in the extractor config
> > > > because of the need to transform data in the flatfile loader. While I
> > > > definitely see the desire to use unix tools to prep data, there are
> > some
> > > > things that aren't as easy to do. For instance, here, removing the
> TLD
> > > of
> > > > a domain is not a trivial task in a shell script and we have existing
> > > > functions for that in Stellar. I would see people using both.
> > > >
> > > > To address the issue of a more targeted experience to bloom, I think
> > that
> > > > sort of specialization should best exist in the UI layer. Having a
> > more
> > > > complete and expressive backend reused across specific UIs seems to
> be
> > > the
> > > > best of all worlds. It allows power users to drop down and do more
> > > complex
> > > > things and still provides a (mostly) code-free and targeted
> experience
> > > for
> > > > users. It seems to me that limiting the expressibility in the backend
> > > > isn't the right way to go since this work just fits in with our
> > existing
> > > > engine.
> > > >
> > > >
> > > > On Thu, Jan 4, 2018 at 1:40 AM, James Sirota <jsir...@apache.org>
> > wrote:
> > > >
> > > >> I just went through these pull requests as well and also agree this
> is
> > > >> good work. I think it's a good first pass. I would be careful with
> > > trying
> > > >> to boil the ocean here. I think for the initial use case I would
> only
> > > >> support loading the bloom filters from HDFS. If people want to
> > > pre-process
> > > >> the CSV file of domains using awk or sed this should be out of scope
> > of
> > > >> this work. It's easy enough to do out of band and I would not
> include
> > > any
> > > >> of these functions at all. I also think that the config could be
> > > >> considerably simplified. I think value_filter should be removed
> > (since
> > > I
> > > >> believe that preprocessing should be done by the user outside of
> this
> > > >> process). I also have a question about the init, update, and merge
> > > >> configurations. Would I ever initialize to anything but an empty
> > bloom
> > > >> filter? For the state update would I ever do anything other than add
> > to
> > > >> the bloom filter? For the state merge would I ever do anything other
> > > than
> > > >> merge the states? If the answer to these is 'no', then this should
> > > simply
> > > >> be hard coded and not externalized into config values.
> > > >>
> > > >> 03.01.2018, 14:20, "Michael Miklavcic" <michael.miklav...@gmail.com
> >:
>
> > > >> > I just finished stepping through the typosquatting use case README
> > in
> > > >> your
> > > >> > merge branch. This is really, really good work Casey. I see most
> of
> > > our
> > > >> > previous documentation issues addressed up front, e.g. special
> > > variables
> > > >> > are cited, all new fields explained, side effects documented. The
> > use
> > > >> case
> > > >> > doc brings it all together soup-to-nuts and I think all the pieces
> > > make
> > > >> > sense in a mostly self-contained way. I can't think of anything I
> > had
> > > to
> > > >> > sit and think about for more than a few seconds. I'll be making my
> > way
> > > >> > through your individual PR's in more detail, but my first
> > impressions
> > > >> are
> > > >> > that this is excellent.
> > > >> >
> > > >> > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
> > > >> > michael.miklav...@gmail.com> wrote:
> > > >> >
> > > >> >> I'm liking this design and growth strategy, Casey. I also think
> > Nick
> > > >> and
> > > >> >> Otto have some valid points. I always find there's a natural
> > tension
> > > >> >> between too little, just enough, and boiling the ocean and these
> > > >> discuss
> > > >> >> threads really help drive what the short and long term visions
> > > should
> > > >> look
> > > >> >> like.
> > > >> >>
> > > >> >> On the subject of repositories and strategies, I agree that
> > > pluggable
> > > >> >> repos and strategies for modifying them would be useful. For the
> > > first
> > > >> >> pass, I'd really like to see HDFS with the proposed set of
> Stellar
> > > >> >> functions. This gives us a lot of bang for our buck - we can
> > > >> capitalize on
> > > >> >> a set of powerful features around existence checking earlier
> > without
> > > >> having
> > > >> >> to worry about later interface changes impacting users. With the
> > > >> primary
> > > >> >> interface coming through the JSON config, we are building a nice
> > > >> facade
> > > >> >> that protects users from later implementation abstractions and
> > > >> >> improvements, all while providing a stable enough interface on
> > which
> > > >> we can
> > > >> >> develop UI features as desired. I'd be interested to hear more
> > about
> > > >> what
> > > >> >> features could be provided by a repository as time goes by.
> > > >> Federation,
> > > >> >> permissions, governance, metadata management, perhaps?
> > > >> >>
> > > >> >> I also had some concern over duplicating existing Unix features.
> I
> > > >> think
> > > >> >> where I'm at has been largely addressed by Casey's comments on 1)
> > > >> scaling,
> > > >> >> 2) multiple variables, and 3) portability to Hadoop. Providing 2
> > > >> approaches
> > > >> >> - 1 which is config-based and the other a composable set of
> > > functions
> > > >> gives
> > > >> >> us the ability to provide a core set of features that can later
> be
> > > >> easily
> > > >> >> expanded by users as the need arises. Here again I think the
> > > >> prescribed
> > > >> >> approach provides a strong first pass that we can then expand on
> > > >> without
> > > >> >> concern of future improvements becoming a hassle for end users.
> > > >> >>
> > > >> >> Best,
> > > >> >> Mike
> > > >> >>
> > > >> >> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
> > > >> >> si...@simonellistonball.com> wrote:
> > > >> >>
> > > >> >>> There is some really cool stuff happening here, if only I’d been
> > > >> allowed
> > > >> >>> to see the lists over Christmas... :)
> > > >> >>>
> > > >> >>> A few thoughts...
> > > >> >>>
> > > >> >>> I like Otto’s generalisation of the problem to include specific
> > > local
> > > >> >>> stellar objects in a cache loaded from a store (HDFS seems a
> > > >> natural, but
> > > >> >>> not only place, maybe even a web service / local microservicey
> > > object
> > > >> >>> provider!?) That said, I suspect that’s a good platform
> > > optimisation
> > > >> >>> approach. Should we look at this as a separate piece of work
> > given
> > > it
> > > >> >>> extends beyond the scope of the summarisation concept and
> > > ultimately
> > > >> use it
> > > >> >>> as a back-end to feed the summarising engine proposed here for
> > the
> > > >> >>> enrichment loader?
> > > >> >>>
> > > >> >>> On the more specific use case, one think I would comment on is
> > the
> > > >> >>> configuration approach. The iteration loop
> > > (state_{init|update|merge}
> > > >> >>> should be consistent with the way we handle things like the
> > > profiler
> > > >> >>> config, since it’s the same approach to data handling.
> > > >> >>>
> > > >> >>> The other thing that seems to have crept in here is the
> interface
> > > to
> > > >> >>> something like Spark, which again, I am really very very keen on
> > > >> seeing
> > > >> >>> happen. That said, not sure how that would happen in this
> > context,
> > > >> unless
> > > >> >>> you’re talking about pushing to something like livy for example
> > > >> (eminently
> > > >> >>> sensible for things like cross instance caching and faster
> > RPC-ish
> > > >> access
> > > >> >>> to an existing spark context which seem to be what Casey is
> > driving
> > > >> at with
> > > >> >>> the spark piece.
> > > >> >>>
> > > >> >>> To address the question of text manipulation in Stellar / metron
> > > >> >>> enrichment ingest etc, we already have this outside of the
> > context
> > > >> of the
> > > >> >>> issues here. I would argue that yes, we don’t want too many
> paths
> > > >> for this,
> > > >> >>> and that maybe our parser approach might be heavily related to
> > > >> text-based
> > > >> >>> ingest. I would say the scope worth dealing with here though is
> > not
> > > >> really
> > > >> >>> text manipulation, but summarisation, which is not well served
> by
> > > >> existing
> > > >> >>> CLI tools like awk / sed and friends.
> > > >> >>>
> > > >> >>> Simon
> > > >> >>>
> > > >> >>> > On 3 Jan 2018, at 15:48, Nick Allen <n...@nickallen.org>
> > wrote:
> > > >> >>> >
> > > >> >>> >> Even with 5 threads, it takes an hour for the full Alexa 1m,
> > so
> > > I
> > > >> >>> think
> > > >> >>> > this will impact performance
> > > >> >>> >
> > > >> >>> > What exactly takes an hour? Adding 1M entries to a bloom
> > filter?
> > > >> That
> > > >> >>> > seems really high, unless I am not understanding something.
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <
> > > ceste...@gmail.com>
> > > >> >>> wrote:
> > > >> >>> >
> > > >> >>> >> Thanks for the feedback, Nick.
> > > >> >>> >>
> > > >> >>> >> Regarding "IMHO, I'd rather not reinvent the wheel for text
> > > >> >>> manipulation."
> > > >> >>> >>
> > > >> >>> >> I would argue that we are not reinventing the wheel for text
> > > >> >>> manipulation
> > > >> >>> >> as the extractor config exists already and we are doing a
> > > similar
> > > >> >>> thing in
> > > >> >>> >> the flatfile loader (in fact, the code is reused and merely
> > > >> extended).
> > > >> >>> >> Transformation operations are already supported in our
> > codebase
> > > >> in the
> > > >> >>> >> extractor config, this PR has just added some hooks for
> > stateful
> > > >> >>> >> operations.
> > > >> >>> >>
> > > >> >>> >> Furthermore, we will need a configuration object to pass to
> > the
> > > >> REST
> > > >> >>> call
> > > >> >>> >> if we are ever to create a UI around importing data into
> hbase
> > > or
> > > >> >>> creating
> > > >> >>> >> these summary objects.
> > > >> >>> >>
> > > >> >>> >> Regarding your example:
> > > >> >>> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' |
> > stellar
> > > -i
> > > >> >>> >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > >> >>> >>
> > > >> >>> >> I'm very sympathetic to this type of extension, but it has
> > some
> > > >> issues:
> > > >> >>> >>
> > > >> >>> >> 1. This implies a single-threaded addition to the bloom
> > filter.
> > > >> >>> >> 1. Even with 5 threads, it takes an hour for the full alexa
> > 1m,
> > > >> >>> so I
> > > >> >>> >> think this will impact performance
> > > >> >>> >> 2. There's not a way to specify how to merge across threads
> if
> > > we
> > > >> >>> do
> > > >> >>> >> make a multithread command line option
> > > >> >>> >> 2. This restricts these kinds of operations to roles with
> > heavy
> > > >> unix
> > > >> >>> CLI
> > > >> >>> >> knowledge, which isn't often the types of people who would be
> > > >> doing
> > > >> >>> this
> > > >> >>> >> type of operation
> > > >> >>> >> 3. What if we need two variables passed to stellar?
> > > >> >>> >> 4. This approach will be harder to move to Hadoop. Eventually
> > we
> > > >> >>> will
> > > >> >>> >> want to support data on HDFS being processed by Hadoop
> > (similar
> > > to
> > > >> >>> >> flatfile
> > > >> >>> >> loader), so instead of -m LOCAL being passed for the flatfile
> > > >> >>> summarizer
> > > >> >>> >> you'd pass -m SPARK and the processing would happen on the
> > > cluster
> > > >> >>> >> 1. This is particularly relevant in this case as it's a
> > > >> >>> >> embarrassingly parallel problem in general
> > > >> >>> >>
> > > >> >>> >> In summary, while this a CLI approach is attractive, I prefer
> > > the
> > > >> >>> extractor
> > > >> >>> >> config solution because it is the solution with the smallest
> > > >> iteration
> > > >> >>> >> that:
> > > >> >>> >>
> > > >> >>> >> 1. Reuses existing metron extraction infrastructure
> > > >> >>> >> 2. Provides the most solid base for the extensions that will
> > be
> > > >> >>> sorely
> > > >> >>> >> needed soon (and will keep it in parity with the flatfile
> > > loader)
> > > >> >>> >> 3. Provides the most solid base for a future UI extension in
> > the
> > > >> >>> >> management UI to support both summarization and loading
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <
> > > n...@nickallen.org>
> > > >> >>> wrote:
> > > >> >>> >>
> > > >> >>> >>> First off, I really do like the typosquatting use case and a
> > > lot
> > > >> of
> > > >> >>> what
> > > >> >>> >>> you have described.
> > > >> >>> >>>
> > > >> >>> >>>> We need a way to generate the summary sketches from flat
> > data
> > > >> for
> > > >> >>> this
> > > >> >>> >> to
> > > >> >>> >>>> work.
> > > >> >>> >>>> ..
> > > >> >>> >>>>
> > > >> >>> >>>
> > > >> >>> >>> I took this quote directly from your use case. Above is the
> > > point
> > > >> >>> that
> > > >> >>> >> I'd
> > > >> >>> >>> like to discuss and what your proposed solutions center on.
> > > This
> > > >> is
> > > >> >>> >> what I
> > > >> >>> >>> think you are trying to do, at least with PR #879
> > > >> >>> >>> <https://github.com/apache/metron/pull/879>...
> > > >> >>> >>>
> > > >> >>> >>> (Q) Can we repurpose Stellar functions so that they can
> > operate
> > > >> on
> > > >> >>> text
> > > >> >>> >>> stored in a file system?
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>> Whether we use the (1) Configuration or the (2)
> > Function-based
> > > >> >>> approach
> > > >> >>> >>> that you described, fundamentally we are introducing new
> ways
> > > to
> > > >> >>> perform
> > > >> >>> >>> text manipulation inside of Stellar.
> > > >> >>> >>>
> > > >> >>> >>> IMHO, I'd rather not reinvent the wheel for text
> > manipulation.
> > > It
> > > >> >>> would
> > > >> >>> >> be
> > > >> >>> >>> painful to implement and maintain a bunch of Stellar
> > functions
> > > >> for
> > > >> >>> text
> > > >> >>> >>> manipulation. People already have a large number of tools
> > > >> available
> > > >> >>> to
> > > >> >>> >> do
> > > >> >>> >>> this and everyone has their favorites. People are resistant
> > to
> > > >> >>> learning
> > > >> >>> >>> something new when they already are familiar with another
> way
> > > to
> > > >> do
> > > >> >>> the
> > > >> >>> >>> same thing.
> > > >> >>> >>>
> > > >> >>> >>> So then the question is, how else can we do this? My
> > suggestion
> > > >> is
> > > >> >>> that
> > > >> >>> >>> rather than introducing text manipulation tools inside of
> > > >> Stellar, we
> > > >> >>> >> allow
> > > >> >>> >>> people to use the text manipulation tools they already know,
> > > but
> > > >> with
> > > >> >>> the
> > > >> >>> >>> Stellar functions that we already have. And the obvious way
> > to
> > > >> tie
> > > >> >>> those
> > > >> >>> >>> two things together is the Unix pipeline.
> > > >> >>> >>>
> > > >> >>> >>> A quick, albeit horribly incomplete, example to flesh this
> > out
> > > a
> > > >> bit
> > > >> >>> more
> > > >> >>> >>> based on the example you have in PR #879
> > > >> >>> >>> <https://github.com/apache/metron/pull/879>. This would
> > allow
> > > >> me to
> > > >> >>> >>> integrate Stellar with whatever external tools that I want.
> > > >> >>> >>>
> > > >> >>> >>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' |
> > stellar
> > > >> -i
> > > >> >>> >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <
> > > >> ceste...@gmail.com>
> > > >> >>> >> wrote:
> > > >> >>> >>>
> > > >> >>> >>>> I'll start this discussion off with my idea around a 2nd
> > step
> > > >> that is
> > > >> >>> >>> more
> > > >> >>> >>>> adaptable. I propose the following set of stellar functions
> > > >> backed
> > > >> >>> by
> > > >> >>> >>>> Spark in the metron-management project:
> > > >> >>> >>>>
> > > >> >>> >>>> - CSV_PARSE(location, separator?, columns?) : Constructs a
> > > Spark
> > > >> >>> >>>> Dataframe for reading the flatfile
> > > >> >>> >>>> - SQL_TRANSFORM(dataframe, spark sql statement): Transforms
> > > the
> > > >> >>> >>>> dataframe
> > > >> >>> >>>> - SUMMARIZE(state_init, state_update, state_merge):
> > Summarize
> > > >> the
> > > >> >>> >>>> dataframe using the lambda functions:
> > > >> >>> >>>> - state_init - executed once per worker to initialize the
> > > state
> > > >> >>> >>>> - state_update - executed once per row
> > > >> >>> >>>> - state_merge - Merge the worker states into one worker
> > state
> > > >> >>> >>>> - OBJECT_SAVE(obj, output_path) : Save the object obj to
> the
> > > >> path
> > > >> >>> >>>> output_path on HDFS.
> > > >> >>> >>>>
> > > >> >>> >>>> This would enable more flexibility and composibility than
> > the
> > > >> >>> >>>> configuration-based approach that we have in the flatfile
> > > >> loader.
> > > >> >>> >>>> My concern with this approach, and the reason I didn't do
> it
> > > >> >>> initially,
> > > >> >>> >>> was
> > > >> >>> >>>> that I think that users will want at least 2 ways to
> > summarize
> > > >> data
> > > >> >>> (or
> > > >> >>> >>>> load data):
> > > >> >>> >>>>
> > > >> >>> >>>> - A configuration based approach, which enables a UI
> > > >> >>> >>>> - A set of stellar functions via the scriptable REPL
> > > >> >>> >>>>
> > > >> >>> >>>> I would argue that both have a place and I started with the
> > > >> >>> >> configuration
> > > >> >>> >>>> based approach as it was a more natural extension of what
> we
> > > >> already
> > > >> >>> >> had.
> > > >> >>> >>>> I'd love to hear thoughts about this idea too.
> > > >> >>> >>>>
> > > >> >>> >>>>
> > > >> >>> >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <
> > > >> ceste...@gmail.com>
> > > >> >>> >>> wrote:
> > > >> >>> >>>>
> > > >> >>> >>>>> Hi all,
> > > >> >>> >>>>>
> > > >> >>> >>>>> I wanted to get some feedback on a sensible plan for
> > > >> something. It
> > > >> >>> >>>>> occurred to me the other day when considering the use-case
> > of
> > > >> >>> >> detecting
> > > >> >>> >>>>> typosquatted domains, that one approach was to generate
> the
> > > >> set of
> > > >> >>> >>>>> typosquatted domains for some set of reference domains and
> > > >> compare
> > > >> >>> >>>> domains
> > > >> >>> >>>>> as they flow through.
> > > >> >>> >>>>>
> > > >> >>> >>>>> One way we could do this would be to generate this data
> and
> > > >> import
> > > >> >>> >> the
> > > >> >>> >>>>> typosquatted domains into HBase. I thought, however, that
> > > >> another
> > > >> >>> >>>> approach
> > > >> >>> >>>>> which may trade-off accuracy to remove the network hop and
> > > >> potential
> > > >> >>> >>> disk
> > > >> >>> >>>>> seek by constructing a bloom filter that includes the set
> > of
> > > >> >>> >>> typosquatted
> > > >> >>> >>>>> domains.
> > > >> >>> >>>>>
> > > >> >>> >>>>> The challenge was that we don't have a way to do this
> > > >> currently. We
> > > >> >>> >>> do,
> > > >> >>> >>>>> however, have a loading infrastructure (e.g. the
> > > >> flatfile_loader)
> > > >> >>> and
> > > >> >>> >>>>> configuration (see https://github.com/apache/
> > > >> >>> >>> metron/tree/master/metron-
> > > >> >>> >>>>> platform/metron-data-management#common-extractor-
> > properties)
> > > >> which
> > > >> >>> >>>>> handles:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - parsing flat files
> > > >> >>> >>>>> - transforming the rows
> > > >> >>> >>>>> - filtering the rows
> > > >> >>> >>>>>
> > > >> >>> >>>>> To enable the new use-case of generating a summary object
> > > >> (e.g. a
> > > >> >>> >> bloom
> > > >> >>> >>>>> filter), in METRON-1378 (https://github.com/apache/met
> > > >> ron/pull/879)
> > > >> >>> >> I
> > > >> >>> >>>>> propose that we create a new utility that uses the same
> > > >> extractor
> > > >> >>> >>> config
> > > >> >>> >>>>> add the ability to:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - initialize a state object
> > > >> >>> >>>>> - update the object for every row
> > > >> >>> >>>>> - merge the state objects (in the case of multiple
> threads,
> > > in
> > > >> the
> > > >> >>> >>>>> case of one thread it's not needed).
> > > >> >>> >>>>>
> > > >> >>> >>>>> I think this is a sensible decision because:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - It's a minimal movement from the flat file loader
> > > >> >>> >>>>> - Uses the same configs
> > > >> >>> >>>>> - Abstracts and reuses the existing infrastructure
> > > >> >>> >>>>> - Having one extractor config means that it should be
> > easier
> > > to
> > > >> >>> >>>>> generate a UI around this to simplify the experience
> > > >> >>> >>>>>
> > > >> >>> >>>>> All that being said, our extractor config is..shall we
> > > >> >>> say...daunting
> > > >> >>> >>> :).
> > > >> >>> >>>>> I am sensitive to the fact that this adds to an existing
> > > >> difficult
> > > >> >>> >>>> config.
> > > >> >>> >>>>> I propose that this is an initial step forward to support
> > the
> > > >> >>> >> use-case
> > > >> >>> >>>> and
> > > >> >>> >>>>> we can enable something more composable going forward. My
> > > >> concern
> > > >> >>> in
> > > >> >>> >>>>> considering this as the first step was that it felt that
> > the
> > > >> >>> >> composable
> > > >> >>> >>>>> units for data transformation and manipulation suddenly
> > takes
> > > >> us
> > > >> >>> >> into a
> > > >> >>> >>>>> place where Stellar starts to look like Pig or Spark RDD
> > > API. I
> > > >> >>> >> wasn't
> > > >> >>> >>>>> ready for that without a lot more discussion.
> > > >> >>> >>>>>
> > > >> >>> >>>>> To summarize, what I'd like to get from the community is,
> > > after
> > > >> >>> >>> reviewing
> > > >> >>> >>>>> the entire use-case at https://github.com/cestella/
> > > >> >>> >>>> incubator-metron/tree/
> > > >> >>> >>>>> typosquat_merge/use-cases/typosquat_detection:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - Is this so confusing that it does not belong in Metron
> > even
> > > >> as a
> > > >> >>> >>>>> first-step?
> > > >> >>> >>>>> - Is there a way to extend the extractor config in a less
> > > >> >>> >> confusing
> > > >> >>> >>>>> way to enable this?
> > > >> >>> >>>>>
> > > >> >>> >>>>> I apologize for making the discuss thread *after* the
> > JIRAs,
> > > >> but I
> > > >> >>> >> felt
> > > >> >>> >>>>> this one might bear having some working code to consider.
> > > >> >>> >>>>>
> > > >> >>> >>>>
> > > >> >>> >>>
> > > >> >>> >>
> > > >>
> > > >> -------------------
> > > >> Thank you,
> > > >>
> > > >> James Sirota
> > > >> PMC- Apache Metron
> > > >> jsirota AT apache DOT org
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to