Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Michael Miklavcic
Any volunteers for creating a set of jiras and feature branch for an object
store repository? This sounds like a massive feature.

On Jan 5, 2018 2:06 PM, "Otto Fowler"  wrote:

> I would say that at the stellar author level, you would just get objects
> from the store and the ‘override’ case would be a follow on for edge cases.
>
>
> On January 5, 2018 at 14:29:16, Casey Stella (ceste...@gmail.com) wrote:
>
> Well, you can pull the default configs from global configs, but you might
> want to override them (similar to the profiler).  For instance, you might
> want to interact with another hbase table than the one globally configured.
>
> On Fri, Jan 5, 2018 at 12:04 PM, Otto Fowler 
> wrote:
>
> > I would imagine the ‘stellar-object-repo’ would be part of the global
> > configuration or configuration passed to the command.
> > why specify in the function itself?
> >
> >
> >
> >
> > On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote:
> >
> > I like that, specifically the repositories abstraction. Perhaps we can
> > construct some longer term JIRAs for extensions.
> > For the current state of affairs (wrt to the OBJECT_GET call) I was
> > imagining the simple default HDFS solution as a first cut and
> > following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
> > with repo_name being optional and defaulting to HDFS
> > for backwards compatibility.
> >
> > In effect, this would be the next step that I'm proposing
> OBJECT_GET(paths,
> > repo_name, repo_config) which would be backwards compatible
> >
> > - paths - a single path or a list of paths (if a list, then a list of
> > objects returned)
> > - repo_name - optional name for repo, defaulted to HDFS if we don't
> > specify
> > - repo_config - optional config map
> >
> >
> > This would open things like:
> >
> > - OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
> > 'cf'} ) -- pulling from HBase
> >
> > Eventually we might also be able to fold ENRICHMENT_GET as just a special
> > repo instance.
> >
> > On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler 
> > wrote:
> >
> > > If we separate the concerns as I have state previously :
> > >
> > > 1. Stellar can load objects into ‘caches’ from some repository and
> refer
> > to
> > > them.
> > > 2. The repositories
> > > 3. Some number of strategies to populate and possibly update the
> > > repository, from spark,
> > > to MR jobs to whatever you would classify the flat file stuff as.
> > > 4. Let the Stellar API for everything but LOAD() follow after we get
> > usage
> > >
> > > Then the particulars of ‘3’ are less important.
> > >
> > >
> > >
> > > On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com)
> > wrote:
> > >
> > > I agree with the general sentiment that we can tailor specific use
> cases
> > > via UI, and I'm worried that the use case specific solution
> (particularly
> > > in light of the note that it's not even general to the class of bloom
> > > filter problems, let alone an actually general problem) becomes more
> work
> > > than this as soon as about 2 more uses cases actually get realized.
> > > Pushing that to the UI lets people solve a variety of problems if they
> > > really want to dig in, while still giving flexibility to provide a more
> > > tailored experience for what we discover the 80% cases are in practice.
> > >
> > > Keeping in mind I am mostly unfamiliar with the extractor config
> itself,
> > I
> > > am wondering if it makes sense to split up the config a bit. While a
> lot
> > > of implementation details are shared, maybe the extractor config itself
> > > should be refactored into a couple parts analogous to ETL (as a follow
> on
> > > task, I think if this is true, it predates Casey's proposed change). It
> > > doesn't necessarily make it less complex, but it might make it more
> > easily
> > > digestible if it's split up by idea (parsing, transformation, etc.).
> > >
> > > Re: Mike's point, I don't think we want the actual processing broken up
> > as
> > > ETL, but the representation to the user in terms of configuration could
> > be
> > > similar (Since we're already doing parsing and transformation). We
> don't
> > > have to implement it as an ETL pipeline, but it does potentially offer
> > the
> > > user a way to quickly grasp what the JSON blob is actually specifying.
> > > Making it easy to understand, even if it's not the ideal way to
> interact
> > is
> > > potentially still a win.
> > >
> > > On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> > > michael.miklav...@gmail.com> wrote:
> > >
> > > > I mentioned this earlier, but I'll reiterate that I think this
> approach
> > > > gives us the ability to make specific use cases via a UI, or other
> > > > interface should we choose to add one, while keeping the core
> adaptable
> > > and
> > > > flexible. This is ideal for middle tier as I think this effectively
> > gives
> > > > us the ability to pivot to other use cases very easily while not
> being
> > so
> > 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
Yes, abstracted.

We have an api of stellar functions that just load things from the store,
they don’t need to bleed through what the store is.
We have a ‘store’, which may be hdfs or hbase or whatever.
We have an api for adding to the store ( add etc ) that doesn’t
presume the store either.
Then we can have whatever long or short term hard to configure thing to
push to the store that we can imagine.




On January 5, 2018 at 14:16:52, Michael Miklavcic (
michael.miklav...@gmail.com) wrote:

I'm not sure I follow what you're saying as it pertains to summary objects.
Repository is a loaded term, and I'm very apprehensive of pushing for
something potentially very complex where a simpler solution would suffice
in the short term. To wit, the items I'm seeing in this use case doc -
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection
- don't preclude the 4 capabilities you've enumerated. Am I missing
something, or can you provide more context? My best guess is that rather
than referring to a specific HDFS path for a serialized object, you're
suggesting we provide a more abstract method for serializing/deserializing
objects to/from a variety of sources. Am I in the ballpark? I'd be in favor
of expanding functionality for such a thing provided a sensible default (ie
HDFS) is provided in the short-term.

On Fri, Jan 5, 2018 at 8:26 AM, Otto Fowler 
wrote:

> If we separate the concerns as I have state previously :
>
> 1. Stellar can load objects into ‘caches’ from some repository and refer
to
> them.
> 2. The repositories
> 3. Some number of strategies to populate and possibly update the
> repository, from spark,
> to MR jobs to whatever you would classify the flat file stuff as.
> 4. Let the Stellar API for everything but LOAD() follow after we get
usage
>
> Then the particulars of ‘3’ are less important.
>
>
>
> On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com)
wrote:
>
> I agree with the general sentiment that we can tailor specific use cases
> via UI, and I'm worried that the use case specific solution (particularly
> in light of the note that it's not even general to the class of bloom
> filter problems, let alone an actually general problem) becomes more work
> than this as soon as about 2 more uses cases actually get realized.
> Pushing that to the UI lets people solve a variety of problems if they
> really want to dig in, while still giving flexibility to provide a more
> tailored experience for what we discover the 80% cases are in practice.
>
> Keeping in mind I am mostly unfamiliar with the extractor config itself,
I
> am wondering if it makes sense to split up the config a bit. While a lot
> of implementation details are shared, maybe the extractor config itself
> should be refactored into a couple parts analogous to ETL (as a follow on
> task, I think if this is true, it predates Casey's proposed change). It
> doesn't necessarily make it less complex, but it might make it more
easily
> digestible if it's split up by idea (parsing, transformation, etc.).
>
> Re: Mike's point, I don't think we want the actual processing broken up
as
> ETL, but the representation to the user in terms of configuration could
be
> similar (Since we're already doing parsing and transformation). We don't
> have to implement it as an ETL pipeline, but it does potentially offer
the
> user a way to quickly grasp what the JSON blob is actually specifying.
> Making it easy to understand, even if it's not the ideal way to interact
is
> potentially still a win.
>
> On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > I mentioned this earlier, but I'll reiterate that I think this approach
> > gives us the ability to make specific use cases via a UI, or other
> > interface should we choose to add one, while keeping the core adaptable
> and
> > flexible. This is ideal for middle tier as I think this effectively
gives
> > us the ability to pivot to other use cases very easily while not being
so
> > generic as to be useless. The fact that you were able to create this as
> > quickly as you did seems to me directly related to the fact we made the
> > decision to keep the loader somewhat flexible rather than very
specific.
> > The operation ordering and state carry from one phase of processing to
> the
> > next would simply have been inscrutable, if not impossible, with a CLI
> > option-only approach. Sure, it's not as simple as "put infile.txt
> > outfile.txt", but the alternatives are not that clear either. One might
> > argue we could split up the processing pieces as in traditional Hadoop,
> eg
> > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
> But
> > quite frankly that's going in the *opposite* direction I think we want
> > here. That's more complex in terms of moving parts. The config approach
> > with pluggable Stellar insulates users from specific implementations,
but
> > also gives you the ability to pa

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
I would say that at the stellar author level, you would just get objects
from the store and the ‘override’ case would be a follow on for edge cases.


On January 5, 2018 at 14:29:16, Casey Stella (ceste...@gmail.com) wrote:

Well, you can pull the default configs from global configs, but you might
want to override them (similar to the profiler).  For instance, you might
want to interact with another hbase table than the one globally configured.

On Fri, Jan 5, 2018 at 12:04 PM, Otto Fowler 
wrote:

> I would imagine the ‘stellar-object-repo’ would be part of the global
> configuration or configuration passed to the command.
> why specify in the function itself?
>
>
>
>
> On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote:
>
> I like that, specifically the repositories abstraction. Perhaps we can
> construct some longer term JIRAs for extensions.
> For the current state of affairs (wrt to the OBJECT_GET call) I was
> imagining the simple default HDFS solution as a first cut and
> following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
> with repo_name being optional and defaulting to HDFS
> for backwards compatibility.
>
> In effect, this would be the next step that I'm proposing OBJECT_GET(paths,
> repo_name, repo_config) which would be backwards compatible
>
> - paths - a single path or a list of paths (if a list, then a list of
> objects returned)
> - repo_name - optional name for repo, defaulted to HDFS if we don't
> specify
> - repo_config - optional config map
>
>
> This would open things like:
>
> - OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
> 'cf'} ) -- pulling from HBase
>
> Eventually we might also be able to fold ENRICHMENT_GET as just a special
> repo instance.
>
> On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler 
> wrote:
>
> > If we separate the concerns as I have state previously :
> >
> > 1. Stellar can load objects into ‘caches’ from some repository and refer
> to
> > them.
> > 2. The repositories
> > 3. Some number of strategies to populate and possibly update the
> > repository, from spark,
> > to MR jobs to whatever you would classify the flat file stuff as.
> > 4. Let the Stellar API for everything but LOAD() follow after we get
> usage
> >
> > Then the particulars of ‘3’ are less important.
> >
> >
> >
> > On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com)
> wrote:
> >
> > I agree with the general sentiment that we can tailor specific use cases
> > via UI, and I'm worried that the use case specific solution (particularly
> > in light of the note that it's not even general to the class of bloom
> > filter problems, let alone an actually general problem) becomes more work
> > than this as soon as about 2 more uses cases actually get realized.
> > Pushing that to the UI lets people solve a variety of problems if they
> > really want to dig in, while still giving flexibility to provide a more
> > tailored experience for what we discover the 80% cases are in practice.
> >
> > Keeping in mind I am mostly unfamiliar with the extractor config itself,
> I
> > am wondering if it makes sense to split up the config a bit. While a lot
> > of implementation details are shared, maybe the extractor config itself
> > should be refactored into a couple parts analogous to ETL (as a follow on
> > task, I think if this is true, it predates Casey's proposed change). It
> > doesn't necessarily make it less complex, but it might make it more
> easily
> > digestible if it's split up by idea (parsing, transformation, etc.).
> >
> > Re: Mike's point, I don't think we want the actual processing broken up
> as
> > ETL, but the representation to the user in terms of configuration could
> be
> > similar (Since we're already doing parsing and transformation). We don't
> > have to implement it as an ETL pipeline, but it does potentially offer
> the
> > user a way to quickly grasp what the JSON blob is actually specifying.
> > Making it easy to understand, even if it's not the ideal way to interact
> is
> > potentially still a win.
> >
> > On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> > > I mentioned this earlier, but I'll reiterate that I think this approach
> > > gives us the ability to make specific use cases via a UI, or other
> > > interface should we choose to add one, while keeping the core adaptable
> > and
> > > flexible. This is ideal for middle tier as I think this effectively
> gives
> > > us the ability to pivot to other use cases very easily while not being
> so
> > > generic as to be useless. The fact that you were able to create this as
> > > quickly as you did seems to me directly related to the fact we made the
> > > decision to keep the loader somewhat flexible rather than very
> specific.
> > > The operation ordering and state carry from one phase of processing to
> > the
> > > next would simply have been inscrutable, if not impossible, with a CLI
> > > option-only approach. 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Casey Stella
Well, you can pull the default configs from global configs, but you might
want to override them (similar to the profiler).  For instance, you might
want to interact with another hbase table than the one globally configured.

On Fri, Jan 5, 2018 at 12:04 PM, Otto Fowler 
wrote:

> I would imagine the ‘stellar-object-repo’ would be part of the global
> configuration or configuration passed to the command.
> why specify in the function itself?
>
>
>
>
> On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote:
>
> I like that, specifically the repositories abstraction. Perhaps we can
> construct some longer term JIRAs for extensions.
> For the current state of affairs (wrt to the OBJECT_GET call) I was
> imagining the simple default HDFS solution as a first cut and
> following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
> with repo_name being optional and defaulting to HDFS
> for backwards compatibility.
>
> In effect, this would be the next step that I'm proposing
> OBJECT_GET(paths,
> repo_name, repo_config) which would be backwards compatible
>
> - paths - a single path or a list of paths (if a list, then a list of
> objects returned)
> - repo_name - optional name for repo, defaulted to HDFS if we don't
> specify
> - repo_config - optional config map
>
>
> This would open things like:
>
> - OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
> 'cf'} ) -- pulling from HBase
>
> Eventually we might also be able to fold ENRICHMENT_GET as just a special
> repo instance.
>
> On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler 
> wrote:
>
> > If we separate the concerns as I have state previously :
> >
> > 1. Stellar can load objects into ‘caches’ from some repository and refer
> to
> > them.
> > 2. The repositories
> > 3. Some number of strategies to populate and possibly update the
> > repository, from spark,
> > to MR jobs to whatever you would classify the flat file stuff as.
> > 4. Let the Stellar API for everything but LOAD() follow after we get
> usage
> >
> > Then the particulars of ‘3’ are less important.
> >
> >
> >
> > On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com)
> wrote:
> >
> > I agree with the general sentiment that we can tailor specific use cases
> > via UI, and I'm worried that the use case specific solution
> (particularly
> > in light of the note that it's not even general to the class of bloom
> > filter problems, let alone an actually general problem) becomes more
> work
> > than this as soon as about 2 more uses cases actually get realized.
> > Pushing that to the UI lets people solve a variety of problems if they
> > really want to dig in, while still giving flexibility to provide a more
> > tailored experience for what we discover the 80% cases are in practice.
> >
> > Keeping in mind I am mostly unfamiliar with the extractor config itself,
> I
> > am wondering if it makes sense to split up the config a bit. While a lot
> > of implementation details are shared, maybe the extractor config itself
> > should be refactored into a couple parts analogous to ETL (as a follow
> on
> > task, I think if this is true, it predates Casey's proposed change). It
> > doesn't necessarily make it less complex, but it might make it more
> easily
> > digestible if it's split up by idea (parsing, transformation, etc.).
> >
> > Re: Mike's point, I don't think we want the actual processing broken up
> as
> > ETL, but the representation to the user in terms of configuration could
> be
> > similar (Since we're already doing parsing and transformation). We don't
> > have to implement it as an ETL pipeline, but it does potentially offer
> the
> > user a way to quickly grasp what the JSON blob is actually specifying.
> > Making it easy to understand, even if it's not the ideal way to interact
> is
> > potentially still a win.
> >
> > On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> > > I mentioned this earlier, but I'll reiterate that I think this
> approach
> > > gives us the ability to make specific use cases via a UI, or other
> > > interface should we choose to add one, while keeping the core
> adaptable
> > and
> > > flexible. This is ideal for middle tier as I think this effectively
> gives
> > > us the ability to pivot to other use cases very easily while not being
> so
> > > generic as to be useless. The fact that you were able to create this
> as
> > > quickly as you did seems to me directly related to the fact we made
> the
> > > decision to keep the loader somewhat flexible rather than very
> specific.
> > > The operation ordering and state carry from one phase of processing to
> > the
> > > next would simply have been inscrutable, if not impossible, with a CLI
> > > option-only approach. Sure, it's not as simple as "put infile.txt
> > > outfile.txt", but the alternatives are not that clear either. One
> might
> > > argue we could split up the processing pieces as in traditional
> Hadoop,
> > e

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Michael Miklavcic
I'm not sure I follow what you're saying as it pertains to summary objects.
Repository is a loaded term, and I'm very apprehensive of pushing for
something potentially very complex where a simpler solution would suffice
in the short term. To wit, the items I'm seeing in this use case doc -
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection
- don't preclude the 4 capabilities you've enumerated. Am I missing
something, or can you provide more context? My best guess is that rather
than referring to a specific HDFS path for a serialized object, you're
suggesting we provide a more abstract method for serializing/deserializing
objects to/from a variety of sources. Am I in the ballpark? I'd be in favor
of expanding functionality for such a thing provided a sensible default (ie
HDFS) is provided in the short-term.

On Fri, Jan 5, 2018 at 8:26 AM, Otto Fowler  wrote:

> If we separate the concerns as I have state previously :
>
> 1. Stellar can load objects into ‘caches’ from some repository and refer to
> them.
> 2. The repositories
> 3. Some number of strategies to populate and possibly update the
> repository, from spark,
> to MR jobs to whatever you would classify the flat file stuff as.
> 4. Let the Stellar API for everything but LOAD() follow after we get usage
>
> Then the particulars of ‘3’ are less important.
>
>
>
> On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com) wrote:
>
> I agree with the general sentiment that we can tailor specific use cases
> via UI, and I'm worried that the use case specific solution (particularly
> in light of the note that it's not even general to the class of bloom
> filter problems, let alone an actually general problem) becomes more work
> than this as soon as about 2 more uses cases actually get realized.
> Pushing that to the UI lets people solve a variety of problems if they
> really want to dig in, while still giving flexibility to provide a more
> tailored experience for what we discover the 80% cases are in practice.
>
> Keeping in mind I am mostly unfamiliar with the extractor config itself, I
> am wondering if it makes sense to split up the config a bit. While a lot
> of implementation details are shared, maybe the extractor config itself
> should be refactored into a couple parts analogous to ETL (as a follow on
> task, I think if this is true, it predates Casey's proposed change). It
> doesn't necessarily make it less complex, but it might make it more easily
> digestible if it's split up by idea (parsing, transformation, etc.).
>
> Re: Mike's point, I don't think we want the actual processing broken up as
> ETL, but the representation to the user in terms of configuration could be
> similar (Since we're already doing parsing and transformation). We don't
> have to implement it as an ETL pipeline, but it does potentially offer the
> user a way to quickly grasp what the JSON blob is actually specifying.
> Making it easy to understand, even if it's not the ideal way to interact is
> potentially still a win.
>
> On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > I mentioned this earlier, but I'll reiterate that I think this approach
> > gives us the ability to make specific use cases via a UI, or other
> > interface should we choose to add one, while keeping the core adaptable
> and
> > flexible. This is ideal for middle tier as I think this effectively gives
> > us the ability to pivot to other use cases very easily while not being so
> > generic as to be useless. The fact that you were able to create this as
> > quickly as you did seems to me directly related to the fact we made the
> > decision to keep the loader somewhat flexible rather than very specific.
> > The operation ordering and state carry from one phase of processing to
> the
> > next would simply have been inscrutable, if not impossible, with a CLI
> > option-only approach. Sure, it's not as simple as "put infile.txt
> > outfile.txt", but the alternatives are not that clear either. One might
> > argue we could split up the processing pieces as in traditional Hadoop,
> eg
> > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
> But
> > quite frankly that's going in the *opposite* direction I think we want
> > here. That's more complex in terms of moving parts. The config approach
> > with pluggable Stellar insulates users from specific implementations, but
> > also gives you the ability to pass lower level constructs, eg Spark SQL
> or
> > HiveQL, should the need arise.
> >
> > In summary, my impressions are that at this point the features and level
> of
> > abstraction feel appropriate to me. I think it buys us 1) learning from a
> > starting typosquatting use case, 2) flexibility to change and adapt it
> > without affecting users, and 3) enough concrete capability to make more
> > specific use cases easy to deliver with a UI.
> >
> > Cheers,
> > Mike
> >
> > On Jan 4, 2018 9:59

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
I would imagine the ‘stellar-object-repo’ would be part of the global
configuration or configuration passed to the command.
why specify in the function itself?




On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote:

I like that, specifically the repositories abstraction. Perhaps we can
construct some longer term JIRAs for extensions.
For the current state of affairs (wrt to the OBJECT_GET call) I was
imagining the simple default HDFS solution as a first cut and
following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
with repo_name being optional and defaulting to HDFS
for backwards compatibility.

In effect, this would be the next step that I'm proposing OBJECT_GET(paths,
repo_name, repo_config) which would be backwards compatible

- paths - a single path or a list of paths (if a list, then a list of
objects returned)
- repo_name - optional name for repo, defaulted to HDFS if we don't
specify
- repo_config - optional config map


This would open things like:

- OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
'cf'} ) -- pulling from HBase

Eventually we might also be able to fold ENRICHMENT_GET as just a special
repo instance.

On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler 
wrote:

> If we separate the concerns as I have state previously :
>
> 1. Stellar can load objects into ‘caches’ from some repository and refer
to
> them.
> 2. The repositories
> 3. Some number of strategies to populate and possibly update the
> repository, from spark,
> to MR jobs to whatever you would classify the flat file stuff as.
> 4. Let the Stellar API for everything but LOAD() follow after we get
usage
>
> Then the particulars of ‘3’ are less important.
>
>
>
> On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com)
wrote:
>
> I agree with the general sentiment that we can tailor specific use cases
> via UI, and I'm worried that the use case specific solution (particularly
> in light of the note that it's not even general to the class of bloom
> filter problems, let alone an actually general problem) becomes more work
> than this as soon as about 2 more uses cases actually get realized.
> Pushing that to the UI lets people solve a variety of problems if they
> really want to dig in, while still giving flexibility to provide a more
> tailored experience for what we discover the 80% cases are in practice.
>
> Keeping in mind I am mostly unfamiliar with the extractor config itself,
I
> am wondering if it makes sense to split up the config a bit. While a lot
> of implementation details are shared, maybe the extractor config itself
> should be refactored into a couple parts analogous to ETL (as a follow on
> task, I think if this is true, it predates Casey's proposed change). It
> doesn't necessarily make it less complex, but it might make it more
easily
> digestible if it's split up by idea (parsing, transformation, etc.).
>
> Re: Mike's point, I don't think we want the actual processing broken up
as
> ETL, but the representation to the user in terms of configuration could
be
> similar (Since we're already doing parsing and transformation). We don't
> have to implement it as an ETL pipeline, but it does potentially offer
the
> user a way to quickly grasp what the JSON blob is actually specifying.
> Making it easy to understand, even if it's not the ideal way to interact
is
> potentially still a win.
>
> On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > I mentioned this earlier, but I'll reiterate that I think this approach
> > gives us the ability to make specific use cases via a UI, or other
> > interface should we choose to add one, while keeping the core adaptable
> and
> > flexible. This is ideal for middle tier as I think this effectively
gives
> > us the ability to pivot to other use cases very easily while not being
so
> > generic as to be useless. The fact that you were able to create this as
> > quickly as you did seems to me directly related to the fact we made the
> > decision to keep the loader somewhat flexible rather than very
specific.
> > The operation ordering and state carry from one phase of processing to
> the
> > next would simply have been inscrutable, if not impossible, with a CLI
> > option-only approach. Sure, it's not as simple as "put infile.txt
> > outfile.txt", but the alternatives are not that clear either. One might
> > argue we could split up the processing pieces as in traditional Hadoop,
> eg
> > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
> But
> > quite frankly that's going in the *opposite* direction I think we want
> > here. That's more complex in terms of moving parts. The config approach
> > with pluggable Stellar insulates users from specific implementations,
but
> > also gives you the ability to pass lower level constructs, eg Spark SQL
> or
> > HiveQL, should the need arise.
> >
> > In summary, my impressions are that at this point the features and
lev

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Casey Stella
I like that, specifically the repositories abstraction.  Perhaps we can
construct some longer term JIRAs for extensions.
For the current state of affairs (wrt to the OBJECT_GET call) I was
imagining the simple default HDFS solution as a first cut and
following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
with repo_name being optional and defaulting to HDFS
for backwards compatibility.

In effect, this would be the next step that I'm proposing OBJECT_GET(paths,
repo_name, repo_config) which would be backwards compatible

   - paths - a single path or a list of paths (if a list, then a list of
   objects returned)
   - repo_name - optional name for repo, defaulted to HDFS if we don't
   specify
   - repo_config - optional config map


This would open things like:

   - OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
   'cf'} ) -- pulling from HBase

Eventually we might also be able to fold ENRICHMENT_GET as just a special
repo instance.

On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler 
wrote:

> If we separate the concerns as I have state previously :
>
> 1. Stellar can load objects into ‘caches’ from some repository and refer to
> them.
> 2. The repositories
> 3. Some number of strategies to populate and possibly update the
> repository, from spark,
> to MR jobs to whatever you would classify the flat file stuff as.
> 4. Let the Stellar API for everything but LOAD() follow after we get usage
>
> Then the particulars of ‘3’ are less important.
>
>
>
> On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com) wrote:
>
> I agree with the general sentiment that we can tailor specific use cases
> via UI, and I'm worried that the use case specific solution (particularly
> in light of the note that it's not even general to the class of bloom
> filter problems, let alone an actually general problem) becomes more work
> than this as soon as about 2 more uses cases actually get realized.
> Pushing that to the UI lets people solve a variety of problems if they
> really want to dig in, while still giving flexibility to provide a more
> tailored experience for what we discover the 80% cases are in practice.
>
> Keeping in mind I am mostly unfamiliar with the extractor config itself, I
> am wondering if it makes sense to split up the config a bit. While a lot
> of implementation details are shared, maybe the extractor config itself
> should be refactored into a couple parts analogous to ETL (as a follow on
> task, I think if this is true, it predates Casey's proposed change). It
> doesn't necessarily make it less complex, but it might make it more easily
> digestible if it's split up by idea (parsing, transformation, etc.).
>
> Re: Mike's point, I don't think we want the actual processing broken up as
> ETL, but the representation to the user in terms of configuration could be
> similar (Since we're already doing parsing and transformation). We don't
> have to implement it as an ETL pipeline, but it does potentially offer the
> user a way to quickly grasp what the JSON blob is actually specifying.
> Making it easy to understand, even if it's not the ideal way to interact is
> potentially still a win.
>
> On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > I mentioned this earlier, but I'll reiterate that I think this approach
> > gives us the ability to make specific use cases via a UI, or other
> > interface should we choose to add one, while keeping the core adaptable
> and
> > flexible. This is ideal for middle tier as I think this effectively gives
> > us the ability to pivot to other use cases very easily while not being so
> > generic as to be useless. The fact that you were able to create this as
> > quickly as you did seems to me directly related to the fact we made the
> > decision to keep the loader somewhat flexible rather than very specific.
> > The operation ordering and state carry from one phase of processing to
> the
> > next would simply have been inscrutable, if not impossible, with a CLI
> > option-only approach. Sure, it's not as simple as "put infile.txt
> > outfile.txt", but the alternatives are not that clear either. One might
> > argue we could split up the processing pieces as in traditional Hadoop,
> eg
> > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
> But
> > quite frankly that's going in the *opposite* direction I think we want
> > here. That's more complex in terms of moving parts. The config approach
> > with pluggable Stellar insulates users from specific implementations, but
> > also gives you the ability to pass lower level constructs, eg Spark SQL
> or
> > HiveQL, should the need arise.
> >
> > In summary, my impressions are that at this point the features and level
> of
> > abstraction feel appropriate to me. I think it buys us 1) learning from a
> > starting typosquatting use case, 2) flexibility to change and adapt it
> > without affecting users, and 3) enough concrete cap

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
If we separate the concerns as I have state previously :

1. Stellar can load objects into ‘caches’ from some repository and refer to
them.
2. The repositories
3. Some number of strategies to populate and possibly update the
repository, from spark,
to MR jobs to whatever you would classify the flat file stuff as.
4. Let the Stellar API for everything but LOAD() follow after we get usage

Then the particulars of ‘3’ are less important.



On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com) wrote:

I agree with the general sentiment that we can tailor specific use cases
via UI, and I'm worried that the use case specific solution (particularly
in light of the note that it's not even general to the class of bloom
filter problems, let alone an actually general problem) becomes more work
than this as soon as about 2 more uses cases actually get realized.
Pushing that to the UI lets people solve a variety of problems if they
really want to dig in, while still giving flexibility to provide a more
tailored experience for what we discover the 80% cases are in practice.

Keeping in mind I am mostly unfamiliar with the extractor config itself, I
am wondering if it makes sense to split up the config a bit. While a lot
of implementation details are shared, maybe the extractor config itself
should be refactored into a couple parts analogous to ETL (as a follow on
task, I think if this is true, it predates Casey's proposed change). It
doesn't necessarily make it less complex, but it might make it more easily
digestible if it's split up by idea (parsing, transformation, etc.).

Re: Mike's point, I don't think we want the actual processing broken up as
ETL, but the representation to the user in terms of configuration could be
similar (Since we're already doing parsing and transformation). We don't
have to implement it as an ETL pipeline, but it does potentially offer the
user a way to quickly grasp what the JSON blob is actually specifying.
Making it easy to understand, even if it's not the ideal way to interact is
potentially still a win.

On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> I mentioned this earlier, but I'll reiterate that I think this approach
> gives us the ability to make specific use cases via a UI, or other
> interface should we choose to add one, while keeping the core adaptable
and
> flexible. This is ideal for middle tier as I think this effectively gives
> us the ability to pivot to other use cases very easily while not being so
> generic as to be useless. The fact that you were able to create this as
> quickly as you did seems to me directly related to the fact we made the
> decision to keep the loader somewhat flexible rather than very specific.
> The operation ordering and state carry from one phase of processing to
the
> next would simply have been inscrutable, if not impossible, with a CLI
> option-only approach. Sure, it's not as simple as "put infile.txt
> outfile.txt", but the alternatives are not that clear either. One might
> argue we could split up the processing pieces as in traditional Hadoop,
eg
> ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
But
> quite frankly that's going in the *opposite* direction I think we want
> here. That's more complex in terms of moving parts. The config approach
> with pluggable Stellar insulates users from specific implementations, but
> also gives you the ability to pass lower level constructs, eg Spark SQL
or
> HiveQL, should the need arise.
>
> In summary, my impressions are that at this point the features and level
of
> abstraction feel appropriate to me. I think it buys us 1) learning from a
> starting typosquatting use case, 2) flexibility to change and adapt it
> without affecting users, and 3) enough concrete capability to make more
> specific use cases easy to deliver with a UI.
>
> Cheers,
> Mike
>
> On Jan 4, 2018 9:59 AM, "Casey Stella"  wrote:
>
> > It also occurs to me that even in this situation, it's not a sufficient
> > generalization for just Bloom, but this is a bloom filter of the output
> of
> > the all the typosquatted domains for the domain in each row. If we
> wanted
> > to hard code, we'd have to hard code specifically the bloom filter
*for*
> > typosquatting use-case. Hard coding this would prevent things like
bloom
> > filters containing malicious IPs from a reference source, for instance.
> >
> > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella 
> wrote:
> >
> > > So, there is value outside of just bloom usage. The most specific
> > example
> > > of this would be in order to configure a bloom filter, we need to
know
> at
> > > least an upper bound of the number of items that are going to be
added
> to
> > > the bloom filter. In order to do that, we need to count the number of
> > > typosquatted domains. Specifically at https://github.com/
> > > cestella/incubator-metron/tree/typosquat_merge/use-
> > > cases/typosquat_detection#configure-the-bloom-filt

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Justin Leet
I agree with the general sentiment that we can tailor specific use cases
via UI, and I'm worried that the use case specific solution (particularly
in light of the note that it's not even general to the class of bloom
filter problems, let alone an actually general problem) becomes more work
than this as soon as about 2 more uses cases actually get realized.
Pushing that to the UI lets people solve a variety of problems if they
really want to dig in, while still giving flexibility to provide a more
tailored experience for what we discover the 80% cases are in practice.

Keeping in mind I am mostly unfamiliar with the extractor config itself, I
am wondering if it makes sense to split up the config a bit.  While a lot
of implementation details are shared, maybe the extractor config itself
should be refactored into a couple parts analogous to ETL (as a follow on
task, I think if this is true, it predates Casey's proposed change).  It
doesn't necessarily make it less complex, but it might make it more easily
digestible if it's split up by idea (parsing, transformation, etc.).

Re: Mike's point, I don't think we want the actual processing broken up as
ETL, but the representation to the user in terms of configuration could be
similar (Since we're already doing parsing and transformation). We don't
have to implement it as an ETL pipeline, but it does potentially offer the
user a way to quickly grasp what the JSON blob is actually specifying.
Making it easy to understand, even if it's not the ideal way to interact is
potentially still a win.

On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> I mentioned this earlier, but I'll reiterate that I think this approach
> gives us the ability to make specific use cases via a UI, or other
> interface should we choose to add one, while keeping the core adaptable and
> flexible. This is ideal for middle tier as I think this effectively gives
> us the ability to pivot to other use cases very easily while not being so
> generic as to be useless. The fact that you were able to create this as
> quickly as you did seems to me directly related to the fact we made the
> decision to keep the loader somewhat flexible rather than very specific.
> The operation ordering and state carry from one phase of processing to the
> next would simply have been inscrutable, if not impossible, with a CLI
> option-only approach. Sure, it's not as simple as "put infile.txt
> outfile.txt", but the alternatives are not that clear either. One might
> argue we could split up the processing pieces as in traditional Hadoop, eg
> ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform. But
> quite frankly that's going in the *opposite* direction I think we want
> here. That's more complex in terms of moving parts. The config approach
> with pluggable Stellar insulates users from specific implementations, but
> also gives you the ability to pass lower level constructs, eg Spark SQL or
> HiveQL, should the need arise.
>
> In summary, my impressions are that at this point the features and level of
> abstraction feel appropriate to me. I think it buys us 1) learning from a
> starting typosquatting use case, 2) flexibility to change and adapt it
> without affecting users, and 3) enough concrete capability to make more
> specific use cases easy to deliver with a UI.
>
> Cheers,
> Mike
>
> On Jan 4, 2018 9:59 AM, "Casey Stella"  wrote:
>
> > It also occurs to me that even in this situation, it's not a sufficient
> > generalization for just Bloom, but this is a bloom filter of the output
> of
> > the all the typosquatted domains for the domain in each row.  If we
> wanted
> > to hard code, we'd have to hard code specifically the bloom filter *for*
> > typosquatting use-case.  Hard coding this would prevent things like bloom
> > filters containing malicious IPs from a reference source, for instance.
> >
> > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella 
> wrote:
> >
> > > So, there is value outside of just bloom usage.  The most specific
> > example
> > > of this would be in order to configure a bloom filter, we need to know
> at
> > > least an upper bound of the number of items that are going to be added
> to
> > > the bloom filter.  In order to do that, we need to count the number of
> > > typosquatted domains.  Specifically at https://github.com/
> > > cestella/incubator-metron/tree/typosquat_merge/use-
> > > cases/typosquat_detection#configure-the-bloom-filter you can see how
> we
> > > use the CONSOLE writer with an extractor config to count the number of
> > > typosquatted domains in the alexa top 10k dataset so we can size the
> > filter
> > > appropriately.
> > >
> > > I'd argue that other types of probabalistic data structures could also
> > > make sense here as well, like statistical sketches. Consider, for
> > instance,
> > > a cheap and dirty DGA indicator where we take the Alexa top 1M and look
> > at
> > > the distribution of shannon entropy in the 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-04 Thread Michael Miklavcic
I mentioned this earlier, but I'll reiterate that I think this approach
gives us the ability to make specific use cases via a UI, or other
interface should we choose to add one, while keeping the core adaptable and
flexible. This is ideal for middle tier as I think this effectively gives
us the ability to pivot to other use cases very easily while not being so
generic as to be useless. The fact that you were able to create this as
quickly as you did seems to me directly related to the fact we made the
decision to keep the loader somewhat flexible rather than very specific.
The operation ordering and state carry from one phase of processing to the
next would simply have been inscrutable, if not impossible, with a CLI
option-only approach. Sure, it's not as simple as "put infile.txt
outfile.txt", but the alternatives are not that clear either. One might
argue we could split up the processing pieces as in traditional Hadoop, eg
ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform. But
quite frankly that's going in the *opposite* direction I think we want
here. That's more complex in terms of moving parts. The config approach
with pluggable Stellar insulates users from specific implementations, but
also gives you the ability to pass lower level constructs, eg Spark SQL or
HiveQL, should the need arise.

In summary, my impressions are that at this point the features and level of
abstraction feel appropriate to me. I think it buys us 1) learning from a
starting typosquatting use case, 2) flexibility to change and adapt it
without affecting users, and 3) enough concrete capability to make more
specific use cases easy to deliver with a UI.

Cheers,
Mike

On Jan 4, 2018 9:59 AM, "Casey Stella"  wrote:

> It also occurs to me that even in this situation, it's not a sufficient
> generalization for just Bloom, but this is a bloom filter of the output of
> the all the typosquatted domains for the domain in each row.  If we wanted
> to hard code, we'd have to hard code specifically the bloom filter *for*
> typosquatting use-case.  Hard coding this would prevent things like bloom
> filters containing malicious IPs from a reference source, for instance.
>
> On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella  wrote:
>
> > So, there is value outside of just bloom usage.  The most specific
> example
> > of this would be in order to configure a bloom filter, we need to know at
> > least an upper bound of the number of items that are going to be added to
> > the bloom filter.  In order to do that, we need to count the number of
> > typosquatted domains.  Specifically at https://github.com/
> > cestella/incubator-metron/tree/typosquat_merge/use-
> > cases/typosquat_detection#configure-the-bloom-filter you can see how we
> > use the CONSOLE writer with an extractor config to count the number of
> > typosquatted domains in the alexa top 10k dataset so we can size the
> filter
> > appropriately.
> >
> > I'd argue that other types of probabalistic data structures could also
> > make sense here as well, like statistical sketches. Consider, for
> instance,
> > a cheap and dirty DGA indicator where we take the Alexa top 1M and look
> at
> > the distribution of shannon entropy in the domains.  If the shannon
> entropy
> > of a domain going across metron is more than 5 std devs from the mean,
> that
> > could be circumstantial evidence of a malicious attack.  This would
> yield a
> > lot of false positives, but used in conjunction with other indicators it
> > could be valuable.
> >
> > Computing that would be as follows:
> >
> > {
> >   "config" : {
> > "columns" : {
> >"rank" : 0,
> >"domain" : 1
> > },
> > "value_transform" : {
> >"domain" : "DOMAIN_REMOVE_TLD(domain)"
> > },
> > "value_filter" : "LENGTH(domain) > 0",
> > "state_init" : "STATS_INIT()",
> > "state_update" : {
> >"state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
> >  },
> > "state_merge" : "STATS_MERGE(states)",
> > "separator" : ","
> >   },
> >   "extractor" : "CSV"
> > }
> >
> > Also, for another example, imagine a situation where we have a SPARK_SQL
> > engine rather than just LOCAL for summarizing.  We could create a general
> > summary of URL lengths in bro data which could be used for determining if
> > someone is trying to send in very large URLs maliciously (see Jon
> Zeolla's
> > concerns in https://issues.apache.org/jira/browse/METRON-517 for a
> > discussion of this).  In order to do that, we could simply execute:
> >
> > $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o
> /tmp/reference/bro_uri_distribution.ser -e ~/uri_length_extractor.json -p
> 5 -om HDFS -m SPARK_SQL
> >
> > with uri_length_extractor.json containing:
> >
> > {
> >   "config" : {
> > "value_filter" : "LENGTH(uri) > 0",
> > "state_init" : "STATS_INIT()",
> > "state_update" : {
> >"state" : "STATS_ADD(state, LENGTH(uri))"
> >  },
> 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-04 Thread Casey Stella
It also occurs to me that even in this situation, it's not a sufficient
generalization for just Bloom, but this is a bloom filter of the output of
the all the typosquatted domains for the domain in each row.  If we wanted
to hard code, we'd have to hard code specifically the bloom filter *for*
typosquatting use-case.  Hard coding this would prevent things like bloom
filters containing malicious IPs from a reference source, for instance.

On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella  wrote:

> So, there is value outside of just bloom usage.  The most specific example
> of this would be in order to configure a bloom filter, we need to know at
> least an upper bound of the number of items that are going to be added to
> the bloom filter.  In order to do that, we need to count the number of
> typosquatted domains.  Specifically at https://github.com/
> cestella/incubator-metron/tree/typosquat_merge/use-
> cases/typosquat_detection#configure-the-bloom-filter you can see how we
> use the CONSOLE writer with an extractor config to count the number of
> typosquatted domains in the alexa top 10k dataset so we can size the filter
> appropriately.
>
> I'd argue that other types of probabalistic data structures could also
> make sense here as well, like statistical sketches. Consider, for instance,
> a cheap and dirty DGA indicator where we take the Alexa top 1M and look at
> the distribution of shannon entropy in the domains.  If the shannon entropy
> of a domain going across metron is more than 5 std devs from the mean, that
> could be circumstantial evidence of a malicious attack.  This would yield a
> lot of false positives, but used in conjunction with other indicators it
> could be valuable.
>
> Computing that would be as follows:
>
> {
>   "config" : {
> "columns" : {
>"rank" : 0,
>"domain" : 1
> },
> "value_transform" : {
>"domain" : "DOMAIN_REMOVE_TLD(domain)"
> },
> "value_filter" : "LENGTH(domain) > 0",
> "state_init" : "STATS_INIT()",
> "state_update" : {
>"state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
>  },
> "state_merge" : "STATS_MERGE(states)",
> "separator" : ","
>   },
>   "extractor" : "CSV"
> }
>
> Also, for another example, imagine a situation where we have a SPARK_SQL
> engine rather than just LOCAL for summarizing.  We could create a general
> summary of URL lengths in bro data which could be used for determining if
> someone is trying to send in very large URLs maliciously (see Jon Zeolla's
> concerns in https://issues.apache.org/jira/browse/METRON-517 for a
> discussion of this).  In order to do that, we could simply execute:
>
> $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o 
> /tmp/reference/bro_uri_distribution.ser -e ~/uri_length_extractor.json -p 5 
> -om HDFS -m SPARK_SQL
>
> with uri_length_extractor.json containing:
>
> {
>   "config" : {
> "value_filter" : "LENGTH(uri) > 0",
> "state_init" : "STATS_INIT()",
> "state_update" : {
>"state" : "STATS_ADD(state, LENGTH(uri))"
>  },
> "state_merge" : "STATS_MERGE(states)",
> "separator" : ","
>   },
>   "extractor" : "SQL_ROW"
> }
>
>
> Regarding value filter, that's already around in the extractor config
> because of the need to transform data in the flatfile loader.  While I
> definitely see the desire to use unix tools to prep data, there are some
> things that aren't as easy to do.  For instance, here, removing the TLD of
> a domain is not a trivial task in a shell script and we have existing
> functions for that in Stellar.  I would see people using both.
>
> To address the issue of a more targeted experience to bloom, I think that
> sort of specialization should best exist in the UI layer.  Having a more
> complete and expressive backend reused across specific UIs seems to be the
> best of all worlds.  It allows power users to drop down and do more complex
> things and still provides a (mostly) code-free and targeted experience for
> users.  It seems to me that limiting the expressibility in the backend
> isn't the right way to go since this work just fits in with our existing
> engine.
>
>
> On Thu, Jan 4, 2018 at 1:40 AM, James Sirota  wrote:
>
>> I just went through these pull requests as well and also agree this is
>> good work.  I think it's a good first pass.  I would be careful with trying
>> to boil the ocean here.  I think for the initial use case I would only
>> support loading the bloom filters from HDFS.  If people want to pre-process
>> the CSV file of domains using awk or sed this should be out of scope of
>> this work.  It's easy enough to do out of band and I would not include any
>> of these functions at all.   I also think that the config could be
>> considerably simplified.  I think value_filter should be removed (since I
>> believe that preprocessing should be done by the user outside of this
>> process).  I also have a question about the init, update, an

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-04 Thread Casey Stella
So, there is value outside of just bloom usage.  The most specific example
of this would be in order to configure a bloom filter, we need to know at
least an upper bound of the number of items that are going to be added to
the bloom filter.  In order to do that, we need to count the number of
typosquatted domains.  Specifically at
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection#configure-the-bloom-filter
you can see how we use the CONSOLE writer with an extractor config to count
the number of typosquatted domains in the alexa top 10k dataset so we can
size the filter appropriately.

I'd argue that other types of probabalistic data structures could also make
sense here as well, like statistical sketches. Consider, for instance, a
cheap and dirty DGA indicator where we take the Alexa top 1M and look at
the distribution of shannon entropy in the domains.  If the shannon entropy
of a domain going across metron is more than 5 std devs from the mean, that
could be circumstantial evidence of a malicious attack.  This would yield a
lot of false positives, but used in conjunction with other indicators it
could be valuable.

Computing that would be as follows:

{
  "config" : {
"columns" : {
   "rank" : 0,
   "domain" : 1
},
"value_transform" : {
   "domain" : "DOMAIN_REMOVE_TLD(domain)"
},
"value_filter" : "LENGTH(domain) > 0",
"state_init" : "STATS_INIT()",
"state_update" : {
   "state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
 },
"state_merge" : "STATS_MERGE(states)",
"separator" : ","
  },
  "extractor" : "CSV"
}

Also, for another example, imagine a situation where we have a SPARK_SQL
engine rather than just LOCAL for summarizing.  We could create a general
summary of URL lengths in bro data which could be used for determining if
someone is trying to send in very large URLs maliciously (see Jon Zeolla's
concerns in https://issues.apache.org/jira/browse/METRON-517 for a
discussion of this).  In order to do that, we could simply execute:

$METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o
/tmp/reference/bro_uri_distribution.ser -e ~/uri_length_extractor.json
-p 5 -om HDFS -m SPARK_SQL

with uri_length_extractor.json containing:

{
  "config" : {
"value_filter" : "LENGTH(uri) > 0",
"state_init" : "STATS_INIT()",
"state_update" : {
   "state" : "STATS_ADD(state, LENGTH(uri))"
 },
"state_merge" : "STATS_MERGE(states)",
"separator" : ","
  },
  "extractor" : "SQL_ROW"
}


Regarding value filter, that's already around in the extractor config
because of the need to transform data in the flatfile loader.  While I
definitely see the desire to use unix tools to prep data, there are some
things that aren't as easy to do.  For instance, here, removing the TLD of
a domain is not a trivial task in a shell script and we have existing
functions for that in Stellar.  I would see people using both.

To address the issue of a more targeted experience to bloom, I think that
sort of specialization should best exist in the UI layer.  Having a more
complete and expressive backend reused across specific UIs seems to be the
best of all worlds.  It allows power users to drop down and do more complex
things and still provides a (mostly) code-free and targeted experience for
users.  It seems to me that limiting the expressibility in the backend
isn't the right way to go since this work just fits in with our existing
engine.


On Thu, Jan 4, 2018 at 1:40 AM, James Sirota  wrote:

> I just went through these pull requests as well and also agree this is
> good work.  I think it's a good first pass.  I would be careful with trying
> to boil the ocean here.  I think for the initial use case I would only
> support loading the bloom filters from HDFS.  If people want to pre-process
> the CSV file of domains using awk or sed this should be out of scope of
> this work.  It's easy enough to do out of band and I would not include any
> of these functions at all.   I also think that the config could be
> considerably simplified.  I think value_filter should be removed (since I
> believe that preprocessing should be done by the user outside of this
> process).  I also have a question about the init, update, and merge
> configurations.  Would I ever initialize to anything but an empty bloom
> filter?  For the state update would I ever do anything other than add to
> the bloom filter?  For the state merge would I ever do anything other than
> merge the states?  If the answer to these is 'no', then this should simply
> be hard coded and not externalized into config values.
>
> 03.01.2018, 14:20, "Michael Miklavcic" :
> > I just finished stepping through the typosquatting use case README in
> your
> > merge branch. This is really, really good work Casey. I see most of our
> > previous documentation issues addressed up front, e.g. special variables
> > are cited, all new fiel

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread James Sirota
I just went through these pull requests as well and also agree this is good 
work.  I think it's a good first pass.  I would be careful with trying to boil 
the ocean here.  I think for the initial use case I would only support loading 
the bloom filters from HDFS.  If people want to pre-process the CSV file of 
domains using awk or sed this should be out of scope of this work.  It's easy 
enough to do out of band and I would not include any of these functions at all. 
  I also think that the config could be considerably simplified.  I think 
value_filter should be removed (since I believe that preprocessing should be 
done by the user outside of this process).  I also have a question about the 
init, update, and merge configurations.  Would I ever initialize to anything 
but an empty bloom filter?  For the state update would I ever do anything other 
than add to the bloom filter?  For the state merge would I ever do anything 
other than merge the states?  If the answer to these is 'no', then this should 
simply be hard coded and not externalized into config values. 

03.01.2018, 14:20, "Michael Miklavcic" :
> I just finished stepping through the typosquatting use case README in your
> merge branch. This is really, really good work Casey. I see most of our
> previous documentation issues addressed up front, e.g. special variables
> are cited, all new fields explained, side effects documented. The use case
> doc brings it all together soup-to-nuts and I think all the pieces make
> sense in a mostly self-contained way. I can't think of anything I had to
> sit and think about for more than a few seconds. I'll be making my way
> through your individual PR's in more detail, but my first impressions are
> that this is excellent.
>
> On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>>  I'm liking this design and growth strategy, Casey. I also think Nick and
>>  Otto have some valid points. I always find there's a natural tension
>>  between too little, just enough, and boiling the ocean and these discuss
>>  threads really help drive what the short and long term visions should look
>>  like.
>>
>>  On the subject of repositories and strategies, I agree that pluggable
>>  repos and strategies for modifying them would be useful. For the first
>>  pass, I'd really like to see HDFS with the proposed set of Stellar
>>  functions. This gives us a lot of bang for our buck - we can capitalize on
>>  a set of powerful features around existence checking earlier without having
>>  to worry about later interface changes impacting users. With the primary
>>  interface coming through the JSON config, we are building a nice facade
>>  that protects users from later implementation abstractions and
>>  improvements, all while providing a stable enough interface on which we can
>>  develop UI features as desired. I'd be interested to hear more about what
>>  features could be provided by a repository as time goes by. Federation,
>>  permissions, governance, metadata management, perhaps?
>>
>>  I also had some concern over duplicating existing Unix features. I think
>>  where I'm at has been largely addressed by Casey's comments on 1) scaling,
>>  2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches
>>  - 1 which is config-based and the other a composable set of functions gives
>>  us the ability to provide a core set of features that can later be easily
>>  expanded by users as the need arises. Here again I think the prescribed
>>  approach provides a strong first pass that we can then expand on without
>>  concern of future improvements becoming a hassle for end users.
>>
>>  Best,
>>  Mike
>>
>>  On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
>>  si...@simonellistonball.com> wrote:
>>
>>>  There is some really cool stuff happening here, if only I’d been allowed
>>>  to see the lists over Christmas... :)
>>>
>>>  A few thoughts...
>>>
>>>  I like Otto’s generalisation of the problem to include specific local
>>>  stellar objects in a cache loaded from a store (HDFS seems a natural, but
>>>  not only place, maybe even a web service / local microservicey object
>>>  provider!?) That said, I suspect that’s a good platform optimisation
>>>  approach. Should we look at this as a separate piece of work given it
>>>  extends beyond the scope of the summarisation concept and ultimately use it
>>>  as a back-end to feed the summarising engine proposed here for the
>>>  enrichment loader?
>>>
>>>  On the more specific use case, one think I would comment on is the
>>>  configuration approach. The iteration loop (state_{init|update|merge}
>>>  should be consistent with the way we handle things like the profiler
>>>  config, since it’s the same approach to data handling.
>>>
>>>  The other thing that seems to have crept in here is the interface to
>>>  something like Spark, which again, I am really very very keen on seeing
>>>  happen. That said, not sure how that woul

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Michael Miklavcic
I just finished stepping through the typosquatting use case README in your
merge branch. This is really, really good work Casey. I see most of our
previous documentation issues addressed up front, e.g. special variables
are cited, all new fields explained, side effects documented. The use case
doc brings it all together soup-to-nuts and I think all the pieces make
sense in a mostly self-contained way. I can't think of anything I had to
sit and think about for more than a few seconds. I'll be making my way
through your individual PR's in more detail, but my first impressions are
that this is excellent.

On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> I'm liking this design and growth strategy, Casey. I also think Nick and
> Otto have some valid points. I always find there's a natural tension
> between too little, just enough, and boiling the ocean and these discuss
> threads really help drive what the short and long term visions should look
> like.
>
> On the subject of repositories and strategies, I agree that pluggable
> repos and strategies for modifying them would be useful. For the first
> pass, I'd really like to see HDFS with the proposed set of Stellar
> functions. This gives us a lot of bang for our buck - we can capitalize on
> a set of powerful features around existence checking earlier without having
> to worry about later interface changes impacting users. With the primary
> interface coming through the JSON config, we are building a nice facade
> that protects users from later implementation abstractions and
> improvements, all while providing a stable enough interface on which we can
> develop UI features as desired. I'd be interested to hear more about what
> features could be provided by a repository as time goes by. Federation,
> permissions, governance, metadata management, perhaps?
>
> I also had some concern over duplicating existing Unix features. I think
> where I'm at has been largely addressed by Casey's comments on 1) scaling,
> 2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches
> - 1 which is config-based and the other a composable set of functions gives
> us the ability to provide a core set of features that can later be easily
> expanded by users as the need arises. Here again I think the prescribed
> approach provides a strong first pass that we can then expand on without
> concern of future improvements becoming a hassle for end users.
>
> Best,
> Mike
>
> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
>> There is some really cool stuff happening here, if only I’d been allowed
>> to see the lists over Christmas... :)
>>
>> A few thoughts...
>>
>> I like Otto’s generalisation of the problem to include specific local
>> stellar objects in a cache loaded from a store (HDFS seems a natural, but
>> not only place, maybe even a web service / local microservicey object
>> provider!?) That said, I suspect that’s a good platform optimisation
>> approach. Should we look at this as a separate piece of work given it
>> extends beyond the scope of the summarisation concept and ultimately use it
>> as a back-end to feed the summarising engine proposed here for the
>> enrichment loader?
>>
>> On the more specific use case, one think I would comment on is the
>> configuration approach. The iteration loop (state_{init|update|merge}
>> should be consistent with the way we handle things like the profiler
>> config, since it’s the same approach to data handling.
>>
>> The other thing that seems to have crept in here is the interface to
>> something like Spark, which again, I am really very very keen on seeing
>> happen. That said, not sure how that would happen in this context, unless
>> you’re talking about pushing to something like livy for example (eminently
>> sensible for things like cross instance caching and faster RPC-ish access
>> to an existing spark context which seem to be what Casey is driving at with
>> the spark piece.
>>
>> To address the question of text manipulation in Stellar / metron
>> enrichment ingest etc, we already have this outside of the context of the
>> issues here. I would argue that yes, we don’t want too many paths for this,
>> and that maybe our parser approach might be heavily related to text-based
>> ingest. I would say the scope worth dealing with here though is not really
>> text manipulation, but summarisation, which is not well served by existing
>> CLI tools like awk / sed and friends.
>>
>> Simon
>>
>> > On 3 Jan 2018, at 15:48, Nick Allen  wrote:
>> >
>> >> Even with 5 threads, it takes an hour for the full Alexa 1m, so I
>> think
>> > this will impact performance
>> >
>> > What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
>> > seems really high, unless I am not understanding something.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella 
>> wrote:
>> >
>> >> Thanks for the feedback, Nick.

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Michael Miklavcic
I'm liking this design and growth strategy, Casey. I also think Nick and
Otto have some valid points. I always find there's a natural tension
between too little, just enough, and boiling the ocean and these discuss
threads really help drive what the short and long term visions should look
like.

On the subject of repositories and strategies, I agree that pluggable repos
and strategies for modifying them would be useful. For the first pass, I'd
really like to see HDFS with the proposed set of Stellar functions. This
gives us a lot of bang for our buck - we can capitalize on a set of
powerful features around existence checking earlier without having to worry
about later interface changes impacting users. With the primary interface
coming through the JSON config, we are building a nice facade that protects
users from later implementation abstractions and improvements, all while
providing a stable enough interface on which we can develop UI features as
desired. I'd be interested to hear more about what features could be
provided by a repository as time goes by. Federation, permissions,
governance, metadata management, perhaps?

I also had some concern over duplicating existing Unix features. I think
where I'm at has been largely addressed by Casey's comments on 1) scaling,
2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches
- 1 which is config-based and the other a composable set of functions gives
us the ability to provide a core set of features that can later be easily
expanded by users as the need arises. Here again I think the prescribed
approach provides a strong first pass that we can then expand on without
concern of future improvements becoming a hassle for end users.

Best,
Mike

On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> There is some really cool stuff happening here, if only I’d been allowed
> to see the lists over Christmas... :)
>
> A few thoughts...
>
> I like Otto’s generalisation of the problem to include specific local
> stellar objects in a cache loaded from a store (HDFS seems a natural, but
> not only place, maybe even a web service / local microservicey object
> provider!?) That said, I suspect that’s a good platform optimisation
> approach. Should we look at this as a separate piece of work given it
> extends beyond the scope of the summarisation concept and ultimately use it
> as a back-end to feed the summarising engine proposed here for the
> enrichment loader?
>
> On the more specific use case, one think I would comment on is the
> configuration approach. The iteration loop (state_{init|update|merge}
> should be consistent with the way we handle things like the profiler
> config, since it’s the same approach to data handling.
>
> The other thing that seems to have crept in here is the interface to
> something like Spark, which again, I am really very very keen on seeing
> happen. That said, not sure how that would happen in this context, unless
> you’re talking about pushing to something like livy for example (eminently
> sensible for things like cross instance caching and faster RPC-ish access
> to an existing spark context which seem to be what Casey is driving at with
> the spark piece.
>
> To address the question of text manipulation in Stellar / metron
> enrichment ingest etc, we already have this outside of the context of the
> issues here. I would argue that yes, we don’t want too many paths for this,
> and that maybe our parser approach might be heavily related to text-based
> ingest. I would say the scope worth dealing with here though is not really
> text manipulation, but summarisation, which is not well served by existing
> CLI tools like awk / sed and friends.
>
> Simon
>
> > On 3 Jan 2018, at 15:48, Nick Allen  wrote:
> >
> >> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> > this will impact performance
> >
> > What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> > seems really high, unless I am not understanding something.
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella 
> wrote:
> >
> >> Thanks for the feedback, Nick.
> >>
> >> Regarding "IMHO, I'd rather not reinvent the wheel for text
> manipulation."
> >>
> >> I would argue that we are not reinventing the wheel for text
> manipulation
> >> as the extractor config exists already and we are doing a similar thing
> in
> >> the flatfile loader (in fact, the code is reused and merely extended).
> >> Transformation operations are already supported in our codebase in the
> >> extractor config, this PR has just added some hooks for stateful
> >> operations.
> >>
> >> Furthermore, we will need a configuration object to pass to the REST
> call
> >> if we are ever to create a UI around importing data into hbase or
> creating
> >> these summary objects.
> >>
> >> Regarding your example:
> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> >> 'DOMAIN_REMOVE_TLD

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Simon Elliston Ball
There is some really cool stuff happening here, if only I’d been allowed to see 
the lists over Christmas... :)

A few thoughts...

I like Otto’s generalisation of the problem to include specific local stellar 
objects in a cache loaded from a store (HDFS seems a natural, but not only 
place, maybe even a web service / local microservicey object provider!?) That 
said, I suspect that’s a good platform optimisation approach. Should we look at 
this as a separate piece of work given it extends beyond the scope of the 
summarisation concept and ultimately use it as a back-end to feed the 
summarising engine proposed here for the enrichment loader?

On the more specific use case, one think I would comment on is the 
configuration approach. The iteration loop (state_{init|update|merge} should be 
consistent with the way we handle things like the profiler config, since it’s 
the same approach to data handling. 

The other thing that seems to have crept in here is the interface to something 
like Spark, which again, I am really very very keen on seeing happen. That 
said, not sure how that would happen in this context, unless you’re talking 
about pushing to something like livy for example (eminently sensible for things 
like cross instance caching and faster RPC-ish access to an existing spark 
context which seem to be what Casey is driving at with the spark piece. 

To address the question of text manipulation in Stellar / metron enrichment 
ingest etc, we already have this outside of the context of the issues here. I 
would argue that yes, we don’t want too many paths for this, and that maybe our 
parser approach might be heavily related to text-based ingest. I would say the 
scope worth dealing with here though is not really text manipulation, but 
summarisation, which is not well served by existing CLI tools like awk / sed 
and friends.

Simon

> On 3 Jan 2018, at 15:48, Nick Allen  wrote:
> 
>> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> this will impact performance
> 
> What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> seems really high, unless I am not understanding something.
> 
> 
> 
> 
> 
> 
> On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella  wrote:
> 
>> Thanks for the feedback, Nick.
>> 
>> Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
>> 
>> I would argue that we are not reinventing the wheel for text manipulation
>> as the extractor config exists already and we are doing a similar thing in
>> the flatfile loader (in fact, the code is reused and merely extended).
>> Transformation operations are already supported in our codebase in the
>> extractor config, this PR has just added some hooks for stateful
>> operations.
>> 
>> Furthermore, we will need a configuration object to pass to the REST call
>> if we are ever to create a UI around importing data into hbase or creating
>> these summary objects.
>> 
>> Regarding your example:
>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>> 
>> I'm very sympathetic to this type of extension, but it has some issues:
>> 
>>   1. This implies a single-threaded addition to the bloom filter.
>>  1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
>>  think this will impact performance
>>  2. There's not a way to specify how to merge across threads if we do
>>  make a multithread command line option
>>   2. This restricts these kinds of operations to roles with heavy unix CLI
>>   knowledge, which isn't often the types of people who would be doing this
>>   type of operation
>>   3. What if we need two variables passed to stellar?
>>   4. This approach will be harder to move to Hadoop.  Eventually we will
>>   want to support data on HDFS being processed by Hadoop (similar to
>> flatfile
>>   loader), so instead of -m LOCAL being passed for the flatfile summarizer
>>   you'd pass -m SPARK and the processing would happen on the cluster
>>  1. This is particularly relevant in this case as it's a
>>  embarrassingly parallel problem in general
>> 
>> In summary, while this a CLI approach is attractive, I prefer the extractor
>> config solution because it is the solution with the smallest iteration
>> that:
>> 
>>   1. Reuses existing metron extraction infrastructure
>>   2. Provides the most solid base for the extensions that will be sorely
>>   needed soon (and will keep it in parity with the flatfile loader)
>>   3. Provides the most solid base for a future UI extension in the
>>   management UI to support both summarization and loading
>> 
>> 
>> 
>> 
>> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:
>> 
>>> First off, I really do like the typosquatting use case and a lot of what
>>> you have described.
>>> 
 We need a way to generate the summary sketches from flat data for this
>> to
 work.
 ​..​
 
>>> 
>>> I took this quote directly from your use cas

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Nick Allen
Oh, gotcha.  That makes sense.  Thanks for clarifying.

On Wed, Jan 3, 2018 at 12:15 PM, Casey Stella  wrote:

> It's actually many more than 1M.  There are 1M domains, each domain could
> have upwards of 300 - 1000 possible typosquatted domains.
>
> You will notice from
> https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter
> that we are not adding the domain to the bloom filter, we're adding each
> domain generated from DOMAIN_TYPOSQUAT to the bloom filter.  In fact, we
> would very specifically NOT want the base domain as that would not be an
> indication of typosquatting (going to google.com would be legit, going to
> goggle.com would not).
>
>
>
> On Wed, Jan 3, 2018 at 10:48 AM, Nick Allen  wrote:
>
> > > Even with 5 threads, it takes an hour for the full Alexa 1m, so I
> think
> > this will impact performance
> >
> > What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> > seems really high, unless I am not understanding something.
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella 
> wrote:
> >
> > > Thanks for the feedback, Nick.
> > >
> > > Regarding "IMHO, I'd rather not reinvent the wheel for text
> > manipulation."
> > >
> > > I would argue that we are not reinventing the wheel for text
> manipulation
> > > as the extractor config exists already and we are doing a similar thing
> > in
> > > the flatfile loader (in fact, the code is reused and merely extended).
> > > Transformation operations are already supported in our codebase in the
> > > extractor config, this PR has just added some hooks for stateful
> > > operations.
> > >
> > > Furthermore, we will need a configuration object to pass to the REST
> call
> > > if we are ever to create a UI around importing data into hbase or
> > creating
> > > these summary objects.
> > >
> > > Regarding your example:
> > > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > >
> > > I'm very sympathetic to this type of extension, but it has some issues:
> > >
> > >1. This implies a single-threaded addition to the bloom filter.
> > >   1. Even with 5 threads, it takes an hour for the full alexa 1m,
> so
> > I
> > >   think this will impact performance
> > >   2. There's not a way to specify how to merge across threads if we
> > do
> > >   make a multithread command line option
> > >2. This restricts these kinds of operations to roles with heavy unix
> > CLI
> > >knowledge, which isn't often the types of people who would be doing
> > this
> > >type of operation
> > >3. What if we need two variables passed to stellar?
> > >4. This approach will be harder to move to Hadoop.  Eventually we
> will
> > >want to support data on HDFS being processed by Hadoop (similar to
> > > flatfile
> > >loader), so instead of -m LOCAL being passed for the flatfile
> > summarizer
> > >you'd pass -m SPARK and the processing would happen on the cluster
> > >   1. This is particularly relevant in this case as it's a
> > >   embarrassingly parallel problem in general
> > >
> > > In summary, while this a CLI approach is attractive, I prefer the
> > extractor
> > > config solution because it is the solution with the smallest iteration
> > > that:
> > >
> > >1. Reuses existing metron extraction infrastructure
> > >2. Provides the most solid base for the extensions that will be
> sorely
> > >needed soon (and will keep it in parity with the flatfile loader)
> > >3. Provides the most solid base for a future UI extension in the
> > >management UI to support both summarization and loading
> > >
> > >
> > >
> > >
> > > On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen 
> wrote:
> > >
> > > > First off, I really do like the typosquatting use case and a lot of
> > what
> > > > you have described.
> > > >
> > > > > We need a way to generate the summary sketches from flat data for
> > this
> > > to
> > > > > work.
> > > > > ​..​
> > > > >
> > > >
> > > > I took this quote directly from your use case.  Above is the point
> that
> > > I'd
> > > > like to discuss and what your proposed solutions center on.  This is
> > > what I
> > > > think you are trying to do, at least with PR #879
> > > > ...
> > > >
> > > > (Q) Can we repurpose Stellar functions so that they can operate on
> text
> > > > stored in a file system?
> > > >
> > > >
> > > > Whether we use the (1) Configuration or the (2) Function-based
> approach
> > > > that you described, fundamentally we are introducing new ways to
> > perform
> > > > text manipulation inside of Stellar.
> > > >
> > > > IMHO, I'd rather not reinvent the wheel for text manipulation.  It
> > would
> > > be
> > > > painful to implement and maintain a bunch of Stellar functions for
> text
> > > > manipulation.  People already have a large number of tools availabl

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Casey Stella
It's actually many more than 1M.  There are 1M domains, each domain could
have upwards of 300 - 1000 possible typosquatted domains.

You will notice from
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter
that we are not adding the domain to the bloom filter, we're adding each
domain generated from DOMAIN_TYPOSQUAT to the bloom filter.  In fact, we
would very specifically NOT want the base domain as that would not be an
indication of typosquatting (going to google.com would be legit, going to
goggle.com would not).



On Wed, Jan 3, 2018 at 10:48 AM, Nick Allen  wrote:

> > Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> this will impact performance
>
> What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> seems really high, unless I am not understanding something.
>
>
>
>
>
>
> On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella  wrote:
>
> > Thanks for the feedback, Nick.
> >
> > Regarding "IMHO, I'd rather not reinvent the wheel for text
> manipulation."
> >
> > I would argue that we are not reinventing the wheel for text manipulation
> > as the extractor config exists already and we are doing a similar thing
> in
> > the flatfile loader (in fact, the code is reused and merely extended).
> > Transformation operations are already supported in our codebase in the
> > extractor config, this PR has just added some hooks for stateful
> > operations.
> >
> > Furthermore, we will need a configuration object to pass to the REST call
> > if we are ever to create a UI around importing data into hbase or
> creating
> > these summary objects.
> >
> > Regarding your example:
> > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> >
> > I'm very sympathetic to this type of extension, but it has some issues:
> >
> >1. This implies a single-threaded addition to the bloom filter.
> >   1. Even with 5 threads, it takes an hour for the full alexa 1m, so
> I
> >   think this will impact performance
> >   2. There's not a way to specify how to merge across threads if we
> do
> >   make a multithread command line option
> >2. This restricts these kinds of operations to roles with heavy unix
> CLI
> >knowledge, which isn't often the types of people who would be doing
> this
> >type of operation
> >3. What if we need two variables passed to stellar?
> >4. This approach will be harder to move to Hadoop.  Eventually we will
> >want to support data on HDFS being processed by Hadoop (similar to
> > flatfile
> >loader), so instead of -m LOCAL being passed for the flatfile
> summarizer
> >you'd pass -m SPARK and the processing would happen on the cluster
> >   1. This is particularly relevant in this case as it's a
> >   embarrassingly parallel problem in general
> >
> > In summary, while this a CLI approach is attractive, I prefer the
> extractor
> > config solution because it is the solution with the smallest iteration
> > that:
> >
> >1. Reuses existing metron extraction infrastructure
> >2. Provides the most solid base for the extensions that will be sorely
> >needed soon (and will keep it in parity with the flatfile loader)
> >3. Provides the most solid base for a future UI extension in the
> >management UI to support both summarization and loading
> >
> >
> >
> >
> > On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:
> >
> > > First off, I really do like the typosquatting use case and a lot of
> what
> > > you have described.
> > >
> > > > We need a way to generate the summary sketches from flat data for
> this
> > to
> > > > work.
> > > > ​..​
> > > >
> > >
> > > I took this quote directly from your use case.  Above is the point that
> > I'd
> > > like to discuss and what your proposed solutions center on.  This is
> > what I
> > > think you are trying to do, at least with PR #879
> > > ...
> > >
> > > (Q) Can we repurpose Stellar functions so that they can operate on text
> > > stored in a file system?
> > >
> > >
> > > Whether we use the (1) Configuration or the (2) Function-based approach
> > > that you described, fundamentally we are introducing new ways to
> perform
> > > text manipulation inside of Stellar.
> > >
> > > IMHO, I'd rather not reinvent the wheel for text manipulation.  It
> would
> > be
> > > painful to implement and maintain a bunch of Stellar functions for text
> > > manipulation.  People already have a large number of tools available to
> > do
> > > this and everyone has their favorites.  People are resistant to
> learning
> > > something new when they already are familiar with another way to do the
> > > same thing.
> > >
> > > So then the question is, how else can we do this?  My suggestion is
> that
> > > rather than introducing text manipulation tools inside of Stellar, we
> > allow
> > > people t

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Nick Allen
> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
this will impact performance

What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
seems really high, unless I am not understanding something.






On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella  wrote:

> Thanks for the feedback, Nick.
>
> Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
>
> I would argue that we are not reinventing the wheel for text manipulation
> as the extractor config exists already and we are doing a similar thing in
> the flatfile loader (in fact, the code is reused and merely extended).
> Transformation operations are already supported in our codebase in the
> extractor config, this PR has just added some hooks for stateful
> operations.
>
> Furthermore, we will need a configuration object to pass to the REST call
> if we are ever to create a UI around importing data into hbase or creating
> these summary objects.
>
> Regarding your example:
> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>
> I'm very sympathetic to this type of extension, but it has some issues:
>
>1. This implies a single-threaded addition to the bloom filter.
>   1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
>   think this will impact performance
>   2. There's not a way to specify how to merge across threads if we do
>   make a multithread command line option
>2. This restricts these kinds of operations to roles with heavy unix CLI
>knowledge, which isn't often the types of people who would be doing this
>type of operation
>3. What if we need two variables passed to stellar?
>4. This approach will be harder to move to Hadoop.  Eventually we will
>want to support data on HDFS being processed by Hadoop (similar to
> flatfile
>loader), so instead of -m LOCAL being passed for the flatfile summarizer
>you'd pass -m SPARK and the processing would happen on the cluster
>   1. This is particularly relevant in this case as it's a
>   embarrassingly parallel problem in general
>
> In summary, while this a CLI approach is attractive, I prefer the extractor
> config solution because it is the solution with the smallest iteration
> that:
>
>1. Reuses existing metron extraction infrastructure
>2. Provides the most solid base for the extensions that will be sorely
>needed soon (and will keep it in parity with the flatfile loader)
>3. Provides the most solid base for a future UI extension in the
>management UI to support both summarization and loading
>
>
>
>
> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:
>
> > First off, I really do like the typosquatting use case and a lot of what
> > you have described.
> >
> > > We need a way to generate the summary sketches from flat data for this
> to
> > > work.
> > > ​..​
> > >
> >
> > I took this quote directly from your use case.  Above is the point that
> I'd
> > like to discuss and what your proposed solutions center on.  This is
> what I
> > think you are trying to do, at least with PR #879
> > ...
> >
> > (Q) Can we repurpose Stellar functions so that they can operate on text
> > stored in a file system?
> >
> >
> > Whether we use the (1) Configuration or the (2) Function-based approach
> > that you described, fundamentally we are introducing new ways to perform
> > text manipulation inside of Stellar.
> >
> > IMHO, I'd rather not reinvent the wheel for text manipulation.  It would
> be
> > painful to implement and maintain a bunch of Stellar functions for text
> > manipulation.  People already have a large number of tools available to
> do
> > this and everyone has their favorites.  People are resistant to learning
> > something new when they already are familiar with another way to do the
> > same thing.
> >
> > So then the question is, how else can we do this?  My suggestion is that
> > rather than introducing text manipulation tools inside of Stellar, we
> allow
> > people to use the text manipulation tools they already know, but with the
> > Stellar functions that we already have.  And the obvious way to tie those
> > two things together is the Unix pipeline.
> >
> > A quick, albeit horribly incomplete, example to flesh this out a bit more
> > based on the example you have in PR #879
> > .  This would allow me to
> > integrate Stellar with whatever external tools that I want.
> >
> > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella 
> wrote:
> >
> > > I'll start this discussion off with my idea around a 2nd step that is
> > more
> > > adaptable.  I propose the following set of stellar functions backed by
> > > Spark in the metron-manageme

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Casey Stella
Thanks for the feedback, Nick.

Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."

I would argue that we are not reinventing the wheel for text manipulation
as the extractor config exists already and we are doing a similar thing in
the flatfile loader (in fact, the code is reused and merely extended).
Transformation operations are already supported in our codebase in the
extractor config, this PR has just added some hooks for stateful operations.

Furthermore, we will need a configuration object to pass to the REST call
if we are ever to create a UI around importing data into hbase or creating
these summary objects.

Regarding your example:
$ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'

I'm very sympathetic to this type of extension, but it has some issues:

   1. This implies a single-threaded addition to the bloom filter.
  1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
  think this will impact performance
  2. There's not a way to specify how to merge across threads if we do
  make a multithread command line option
   2. This restricts these kinds of operations to roles with heavy unix CLI
   knowledge, which isn't often the types of people who would be doing this
   type of operation
   3. What if we need two variables passed to stellar?
   4. This approach will be harder to move to Hadoop.  Eventually we will
   want to support data on HDFS being processed by Hadoop (similar to flatfile
   loader), so instead of -m LOCAL being passed for the flatfile summarizer
   you'd pass -m SPARK and the processing would happen on the cluster
  1. This is particularly relevant in this case as it's a
  embarrassingly parallel problem in general

In summary, while this a CLI approach is attractive, I prefer the extractor
config solution because it is the solution with the smallest iteration that:

   1. Reuses existing metron extraction infrastructure
   2. Provides the most solid base for the extensions that will be sorely
   needed soon (and will keep it in parity with the flatfile loader)
   3. Provides the most solid base for a future UI extension in the
   management UI to support both summarization and loading




On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:

> First off, I really do like the typosquatting use case and a lot of what
> you have described.
>
> > We need a way to generate the summary sketches from flat data for this to
> > work.
> > ​..​
> >
>
> I took this quote directly from your use case.  Above is the point that I'd
> like to discuss and what your proposed solutions center on.  This is what I
> think you are trying to do, at least with PR #879
> ...
>
> (Q) Can we repurpose Stellar functions so that they can operate on text
> stored in a file system?
>
>
> Whether we use the (1) Configuration or the (2) Function-based approach
> that you described, fundamentally we are introducing new ways to perform
> text manipulation inside of Stellar.
>
> IMHO, I'd rather not reinvent the wheel for text manipulation.  It would be
> painful to implement and maintain a bunch of Stellar functions for text
> manipulation.  People already have a large number of tools available to do
> this and everyone has their favorites.  People are resistant to learning
> something new when they already are familiar with another way to do the
> same thing.
>
> So then the question is, how else can we do this?  My suggestion is that
> rather than introducing text manipulation tools inside of Stellar, we allow
> people to use the text manipulation tools they already know, but with the
> Stellar functions that we already have.  And the obvious way to tie those
> two things together is the Unix pipeline.
>
> A quick, albeit horribly incomplete, example to flesh this out a bit more
> based on the example you have in PR #879
> .  This would allow me to
> integrate Stellar with whatever external tools that I want.
>
> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>
>
>
>
>
>
>
>
> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella  wrote:
>
> > I'll start this discussion off with my idea around a 2nd step that is
> more
> > adaptable.  I propose the following set of stellar functions backed by
> > Spark in the metron-management project:
> >
> >- CSV_PARSE(location, separator?, columns?) : Constructs a Spark
> >Dataframe for reading the flatfile
> >- SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
> > dataframe
> >- SUMMARIZE(state_init, state_update, state_merge): Summarize the
> >dataframe using the lambda functions:
> >   - state_init - executed once per worker to initialize the state
> >   - state_update - executed once per row
> >   - state_merge - Merge the worker states into one 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-26 Thread Nick Allen
First off, I really do like the typosquatting use case and a lot of what
you have described.

> We need a way to generate the summary sketches from flat data for this to
> work.
> ​..​
>

I took this quote directly from your use case.  Above is the point that I'd
like to discuss and what your proposed solutions center on.  This is what I
think you are trying to do, at least with PR #879
...

(Q) Can we repurpose Stellar functions so that they can operate on text
stored in a file system?


Whether we use the (1) Configuration or the (2) Function-based approach
that you described, fundamentally we are introducing new ways to perform
text manipulation inside of Stellar.

IMHO, I'd rather not reinvent the wheel for text manipulation.  It would be
painful to implement and maintain a bunch of Stellar functions for text
manipulation.  People already have a large number of tools available to do
this and everyone has their favorites.  People are resistant to learning
something new when they already are familiar with another way to do the
same thing.

So then the question is, how else can we do this?  My suggestion is that
rather than introducing text manipulation tools inside of Stellar, we allow
people to use the text manipulation tools they already know, but with the
Stellar functions that we already have.  And the obvious way to tie those
two things together is the Unix pipeline.

A quick, albeit horribly incomplete, example to flesh this out a bit more
based on the example you have in PR #879
.  This would allow me to
integrate Stellar with whatever external tools that I want.

$ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'








On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella  wrote:

> I'll start this discussion off with my idea around a 2nd step that is more
> adaptable.  I propose the following set of stellar functions backed by
> Spark in the metron-management project:
>
>- CSV_PARSE(location, separator?, columns?) : Constructs a Spark
>Dataframe for reading the flatfile
>- SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
> dataframe
>- SUMMARIZE(state_init, state_update, state_merge): Summarize the
>dataframe using the lambda functions:
>   - state_init - executed once per worker to initialize the state
>   - state_update - executed once per row
>   - state_merge - Merge the worker states into one worker state
>- OBJECT_SAVE(obj, output_path) : Save the object obj to the path
>output_path on HDFS.
>
> This would enable more flexibility and composibility than the
> configuration-based approach that we have in the flatfile loader.
> My concern with this approach, and the reason I didn't do it initially, was
> that I think that users will want at least 2 ways to summarize data (or
> load data):
>
>- A configuration based approach, which enables a UI
>- A set of stellar functions via the scriptable REPL
>
> I would argue that both have a place and I started with the configuration
> based approach as it was a more natural extension of what we already had.
> I'd love to hear thoughts about this idea too.
>
>
> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella  wrote:
>
> > Hi all,
> >
> > I wanted to get some feedback on a sensible plan for something.  It
> > occurred to me the other day when considering the use-case of detecting
> > typosquatted domains, that one approach was to generate the set of
> > typosquatted domains for some set of reference domains and compare
> domains
> > as they flow through.
> >
> > One way we could do this would be to generate this data and import the
> > typosquatted domains into HBase.  I thought, however, that another
> approach
> > which may trade-off accuracy to remove the network hop and potential disk
> > seek by constructing a bloom filter that includes the set of typosquatted
> > domains.
> >
> > The challenge was that we don't have a way to do this currently.  We do,
> > however, have a loading infrastructure (e.g. the flatfile_loader) and
> > configuration (see https://github.com/apache/metron/tree/master/metron-
> > platform/metron-data-management#common-extractor-properties)  which
> > handles:
> >
> >- parsing flat files
> >- transforming the rows
> >- filtering the rows
> >
> > To enable the new use-case of generating a summary object (e.g. a bloom
> > filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> > propose that we create a new utility that uses the same extractor config
> > add the ability to:
> >
> >- initialize a state object
> >- update the object for every row
> >- merge the state objects (in the case of multiple threads, in the
> >case of one thread it's not needed).
> >
> > I think this is a sensible decision because:
> >
> >- It's a minimal movement from the flat file loader
> >   - Uses

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Otto Fowler
1st.  You are not only one looking at the list on Dec 24th Casey, so don’t
feel bad.


2nd.  Maybe we can separate this into 2 areas of concern.

1. Stellar can load objects into ‘caches’ from some repository and refer to
them.
2. The repositories
3. Some number of strategies to populate and possibly update the
repository, from spark,
to MR jobs to whatever you would classify the flat file stuff as.

wait, separate this into 3, 3 areas of concern!

1. Stellar can load objects into ‘caches’ from some repository and refer to
them.
2. The repositories
3. Some number of strategies to populate and possibly update the
repository, from spark,
to MR jobs to whatever you would classify the flat file stuff as.
4. Let the Stellar API for everything but LOAD() follow after we get usage

4!  4 areas of concern..

wait, I’ll write another reply….




(Nobody expected that!)



On December 24, 2017 at 20:47:17, Casey Stella (ceste...@gmail.com) wrote:

Oh, one more thing, while the example here is around typosquatting, this is
of use outside of that. Pretty much any large existence-style query can be
enabled via this construction (create a summary bloom filter). There are
other use-cases involving other data structures too.

On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella  wrote:

> Hi all,
>
> I wanted to get some feedback on a sensible plan for something. It
> occurred to me the other day when considering the use-case of detecting
> typosquatted domains, that one approach was to generate the set of
> typosquatted domains for some set of reference domains and compare
domains
> as they flow through.
>
> One way we could do this would be to generate this data and import the
> typosquatted domains into HBase. I thought, however, that another
approach
> which may trade-off accuracy to remove the network hop and potential disk
> seek by constructing a bloom filter that includes the set of typosquatted
> domains.
>
> The challenge was that we don't have a way to do this currently. We do,
> however, have a loading infrastructure (e.g. the flatfile_loader) and
> configuration (see https://github.com/apache/metron/tree/master/metron-
> platform/metron-data-management#common-extractor-properties) which
> handles:
>
> - parsing flat files
> - transforming the rows
> - filtering the rows
>
> To enable the new use-case of generating a summary object (e.g. a bloom
> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> propose that we create a new utility that uses the same extractor config
> add the ability to:
>
> - initialize a state object
> - update the object for every row
> - merge the state objects (in the case of multiple threads, in the
> case of one thread it's not needed).
>
> I think this is a sensible decision because:
>
> - It's a minimal movement from the flat file loader
> - Uses the same configs
> - Abstracts and reuses the existing infrastructure
> - Having one extractor config means that it should be easier to
> generate a UI around this to simplify the experience
>
> All that being said, our extractor config is..shall we say...daunting :).
> I am sensitive to the fact that this adds to an existing difficult
config.
> I propose that this is an initial step forward to support the use-case
and
> we can enable something more composable going forward. My concern in
> considering this as the first step was that it felt that the composable
> units for data transformation and manipulation suddenly takes us into a
> place where Stellar starts to look like Pig or Spark RDD API. I wasn't
> ready for that without a lot more discussion.
>
> To summarize, what I'd like to get from the community is, after reviewing
> the entire use-case at https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection:
>
> - Is this so confusing that it does not belong in Metron even as a
> first-step?
> - Is there a way to extend the extractor config in a less confusing
> way to enable this?
>
> I apologize for making the discuss thread *after* the JIRAs, but I felt
> this one might bear having some working code to consider.
>


Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Casey Stella
Oh, one more thing, while the example here is around typosquatting, this is
of use outside of that.  Pretty much any large existence-style query can be
enabled via this construction (create a summary bloom filter).  There are
other use-cases involving other data structures too.

On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella  wrote:

> Hi all,
>
> I wanted to get some feedback on a sensible plan for something.  It
> occurred to me the other day when considering the use-case of detecting
> typosquatted domains, that one approach was to generate the set of
> typosquatted domains for some set of reference domains and compare domains
> as they flow through.
>
> One way we could do this would be to generate this data and import the
> typosquatted domains into HBase.  I thought, however, that another approach
> which may trade-off accuracy to remove the network hop and potential disk
> seek by constructing a bloom filter that includes the set of typosquatted
> domains.
>
> The challenge was that we don't have a way to do this currently.  We do,
> however, have a loading infrastructure (e.g. the flatfile_loader) and
> configuration (see https://github.com/apache/metron/tree/master/metron-
> platform/metron-data-management#common-extractor-properties)  which
> handles:
>
>- parsing flat files
>- transforming the rows
>- filtering the rows
>
> To enable the new use-case of generating a summary object (e.g. a bloom
> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> propose that we create a new utility that uses the same extractor config
> add the ability to:
>
>- initialize a state object
>- update the object for every row
>- merge the state objects (in the case of multiple threads, in the
>case of one thread it's not needed).
>
> I think this is a sensible decision because:
>
>- It's a minimal movement from the flat file loader
>   - Uses the same configs
>   - Abstracts and reuses the existing infrastructure
>- Having one extractor config means that it should be easier to
>generate a UI around this to simplify the experience
>
> All that being said, our extractor config is..shall we say...daunting :).
> I am sensitive to the fact that this adds to an existing difficult config.
> I propose that this is an initial step forward to support the use-case and
> we can enable something more composable going forward.  My concern in
> considering this as the first step was that it felt that the composable
> units for data transformation and manipulation suddenly takes us into a
> place where Stellar starts to look like Pig or Spark RDD API.  I wasn't
> ready for that without a lot more discussion.
>
> To summarize, what I'd like to get from the community is, after reviewing
> the entire use-case at https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection:
>
>- Is this so confusing that it does not belong in Metron even as a
>first-step?
>- Is there a way to extend the extractor config in a less confusing
>way to enable this?
>
> I apologize for making the discuss thread *after* the JIRAs, but I felt
> this one might bear having some working code to consider.
>


Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Casey Stella
I'll start this discussion off with my idea around a 2nd step that is more
adaptable.  I propose the following set of stellar functions backed by
Spark in the metron-management project:

   - CSV_PARSE(location, separator?, columns?) : Constructs a Spark
   Dataframe for reading the flatfile
   - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the dataframe
   - SUMMARIZE(state_init, state_update, state_merge): Summarize the
   dataframe using the lambda functions:
  - state_init - executed once per worker to initialize the state
  - state_update - executed once per row
  - state_merge - Merge the worker states into one worker state
   - OBJECT_SAVE(obj, output_path) : Save the object obj to the path
   output_path on HDFS.

This would enable more flexibility and composibility than the
configuration-based approach that we have in the flatfile loader.
My concern with this approach, and the reason I didn't do it initially, was
that I think that users will want at least 2 ways to summarize data (or
load data):

   - A configuration based approach, which enables a UI
   - A set of stellar functions via the scriptable REPL

I would argue that both have a place and I started with the configuration
based approach as it was a more natural extension of what we already had.
I'd love to hear thoughts about this idea too.


On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella  wrote:

> Hi all,
>
> I wanted to get some feedback on a sensible plan for something.  It
> occurred to me the other day when considering the use-case of detecting
> typosquatted domains, that one approach was to generate the set of
> typosquatted domains for some set of reference domains and compare domains
> as they flow through.
>
> One way we could do this would be to generate this data and import the
> typosquatted domains into HBase.  I thought, however, that another approach
> which may trade-off accuracy to remove the network hop and potential disk
> seek by constructing a bloom filter that includes the set of typosquatted
> domains.
>
> The challenge was that we don't have a way to do this currently.  We do,
> however, have a loading infrastructure (e.g. the flatfile_loader) and
> configuration (see https://github.com/apache/metron/tree/master/metron-
> platform/metron-data-management#common-extractor-properties)  which
> handles:
>
>- parsing flat files
>- transforming the rows
>- filtering the rows
>
> To enable the new use-case of generating a summary object (e.g. a bloom
> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> propose that we create a new utility that uses the same extractor config
> add the ability to:
>
>- initialize a state object
>- update the object for every row
>- merge the state objects (in the case of multiple threads, in the
>case of one thread it's not needed).
>
> I think this is a sensible decision because:
>
>- It's a minimal movement from the flat file loader
>   - Uses the same configs
>   - Abstracts and reuses the existing infrastructure
>- Having one extractor config means that it should be easier to
>generate a UI around this to simplify the experience
>
> All that being said, our extractor config is..shall we say...daunting :).
> I am sensitive to the fact that this adds to an existing difficult config.
> I propose that this is an initial step forward to support the use-case and
> we can enable something more composable going forward.  My concern in
> considering this as the first step was that it felt that the composable
> units for data transformation and manipulation suddenly takes us into a
> place where Stellar starts to look like Pig or Spark RDD API.  I wasn't
> ready for that without a lot more discussion.
>
> To summarize, what I'd like to get from the community is, after reviewing
> the entire use-case at https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection:
>
>- Is this so confusing that it does not belong in Metron even as a
>first-step?
>- Is there a way to extend the extractor config in a less confusing
>way to enable this?
>
> I apologize for making the discuss thread *after* the JIRAs, but I felt
> this one might bear having some working code to consider.
>


[DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Casey Stella
Hi all,

I wanted to get some feedback on a sensible plan for something.  It
occurred to me the other day when considering the use-case of detecting
typosquatted domains, that one approach was to generate the set of
typosquatted domains for some set of reference domains and compare domains
as they flow through.

One way we could do this would be to generate this data and import the
typosquatted domains into HBase.  I thought, however, that another approach
which may trade-off accuracy to remove the network hop and potential disk
seek by constructing a bloom filter that includes the set of typosquatted
domains.

The challenge was that we don't have a way to do this currently.  We do,
however, have a loading infrastructure (e.g. the flatfile_loader) and
configuration (see
https://github.com/apache/metron/tree/master/metron-platform/metron-data-management#common-extractor-properties)
which handles:

   - parsing flat files
   - transforming the rows
   - filtering the rows

To enable the new use-case of generating a summary object (e.g. a bloom
filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
propose that we create a new utility that uses the same extractor config
add the ability to:

   - initialize a state object
   - update the object for every row
   - merge the state objects (in the case of multiple threads, in the case
   of one thread it's not needed).

I think this is a sensible decision because:

   - It's a minimal movement from the flat file loader
  - Uses the same configs
  - Abstracts and reuses the existing infrastructure
   - Having one extractor config means that it should be easier to generate
   a UI around this to simplify the experience

All that being said, our extractor config is..shall we say...daunting :).
I am sensitive to the fact that this adds to an existing difficult config.
I propose that this is an initial step forward to support the use-case and
we can enable something more composable going forward.  My concern in
considering this as the first step was that it felt that the composable
units for data transformation and manipulation suddenly takes us into a
place where Stellar starts to look like Pig or Spark RDD API.  I wasn't
ready for that without a lot more discussion.

To summarize, what I'd like to get from the community is, after reviewing
the entire use-case at
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection
:

   - Is this so confusing that it does not belong in Metron even as a
   first-step?
   - Is there a way to extend the extractor config in a less confusing way
   to enable this?

I apologize for making the discuss thread *after* the JIRAs, but I felt
this one might bear having some working code to consider.