[GitHub] incubator-metron issue #416: METRON-656: Make Stellar 'in' closer to functio...

2017-01-13 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/416
  
@jjmeyer0 Thanks so much for the comments!

1. Yeah, definitely, but this is a language feature rather than a function 
that we're testing, so I put it in the `StellarTest`
2. I think I got that one covered, but let me know if I botched it.  I did 
add the cases late after the PR was submitted.
3. Good catch, I'll do that now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Kibana Tile Map not available

2017-01-13 Thread Nick Allen
Hi Dima - I forgot to email earlier, but the tile maps are working fine for
me using Quick Dev. Still having this problem?

On Wed, Jan 11, 2017 at 6:02 PM, Dima Kovalyov 
wrote:

> Hello,
>
> I have attached screenshot of how my Tile Map looks like right now
> (kibana_tilemap.png).
> Basically there is no geo map, only dots that represent location on the
> map.
>
> It appears that geo map images are hosted on external CDN which is now
> not available for me. I feel like it is not available for you either.
> Is that so?
> Please advise.
>
> Example URLs:
> https://otile3-s.mqcdn.com/tiles/1.0.0/map/2/1/1.jpeg
> https://otile3-s.mqcdn.com/tiles/1.0.0/map/2/0/2.jpeg
>
> - Dima
>



-- 
Nick Allen 


[GitHub] incubator-metron issue #416: METRON-656: Make Stellar 'in' closer to functio...

2017-01-13 Thread jjmeyer0
Github user jjmeyer0 commented on the issue:

https://github.com/apache/incubator-metron/pull/416
  
+1 (non-binding). I'll create a Jira based on the last comments between 
@cestella and me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Kibana Tile Map not available

2017-01-13 Thread Dima Kovalyov
Hello Nick,

Thanks for the reply! Yeah, I still have it. What CDN does you Kibana is
using?
Are you able to access URLs I listed in my original email?

- Dima

On 01/13/2017 10:23 PM, Nick Allen wrote:
> Hi Dima - I forgot to email earlier, but the tile maps are working fine for
> me using Quick Dev. Still having this problem?
>
> On Wed, Jan 11, 2017 at 6:02 PM, Dima Kovalyov 
> wrote:
>
>> Hello,
>>
>> I have attached screenshot of how my Tile Map looks like right now
>> (kibana_tilemap.png).
>> Basically there is no geo map, only dots that represent location on the
>> map.
>>
>> It appears that geo map images are hosted on external CDN which is now
>> not available for me. I feel like it is not available for you either.
>> Is that so?
>> Please advise.
>>
>> Example URLs:
>> https://otile3-s.mqcdn.com/tiles/1.0.0/map/2/1/1.jpeg
>> https://otile3-s.mqcdn.com/tiles/1.0.0/map/2/0/2.jpeg
>>
>> - Dima
>>
>
>



[GitHub] incubator-metron pull request #416: METRON-656: Make Stellar 'in' closer to ...

2017-01-13 Thread jjmeyer0
Github user jjmeyer0 commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/416#discussion_r96071886
  
--- Diff: metron-platform/metron-common/README.md ---
@@ -39,11 +40,17 @@ The following keywords need to be single quote escaped 
in order to be used in St
 | <= | \> | \>= |
 | ? | \+ | \- |
 | , | \* | / |
+|  | \* | / |
 
 Using parens such as: "foo" : "\" requires escaping; "foo": 
"\'\\'"
 
+## Stellar Language Inclusion Checks (`in` and `not in`)
+1. `in` supports string contains. e.g. `'foo' in 'foobar' == true`
+2. `in` supports collection contains. e.g. `'foo' in [ 'foo', 'bar' ] == 
true`
+3. `in` supports map key contains. e.g. `'foo' in { 'foo' : 5} == true`
+4. `not in` is the negation of the in expression. e.g. `'grok' not in 
'foobar' == true`
--- End diff --

Sorry, one last comment. I was a bit curious as to how the expression 
`'grok' not in 'foobar' == true` would be evaluated by Stellar. I wasn't sure 
if it would be `'(grok' not in 'foobar') == true` or `'grok' not in ('foobar' 
== true)`.  Unfortunately when I tried to run a test it said it is not a valid 
expression. I think this may be an issue in the Stellar grammar. It is probably 
outside the scope of this ticket, but I thought I should mention it here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #416: METRON-656: Make Stellar 'in' closer to functio...

2017-01-13 Thread jjmeyer0
Github user jjmeyer0 commented on the issue:

https://github.com/apache/incubator-metron/pull/416
  
I think this is a great feature. I do have a few suggestions/questions:

1. Does it make sense to start moving away from `StellarTest` and break out 
into specific test classes (eg. `StellarArithmeticTest`, 
`StellarPredicateProcessorTest`, etc.)? Or Should this be a task in of itself 
to design a good structure?
2. I think the tests that are there for `in` should also be run against 
`not in` as well. 
3. Update the README.md to include a description of how `in` and `not in` 
should work. For example, I'm not really sure what I should expect from the 
expression: `1 in `.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #316: METRON-503: Metron REST API

2017-01-13 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/incubator-metron/pull/316
  
That use case makes sense.  I will leave it in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #394: METRON-623: Management UI

2017-01-13 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/incubator-metron/pull/394
  
I'm going to close this until METRON-503 gets merged.  Sorry for the 
distraction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #416: METRON-656: Make Stellar 'in' closer to functio...

2017-01-13 Thread jjmeyer0
Github user jjmeyer0 commented on the issue:

https://github.com/apache/incubator-metron/pull/416
  
I found one additional corner case that may need to be addressed. It looks 
the like expression, `null in [ null, 'something' ]`, returns false, but should 
return true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #416: METRON-656: Make Stellar 'in' closer to functio...

2017-01-13 Thread jjmeyer0
Github user jjmeyer0 commented on the issue:

https://github.com/apache/incubator-metron/pull/416
  
Disregard number 2. It was an oversight on my part. Sorry about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Kibana Tile Map not available

2017-01-13 Thread Nick Allen
No, I cannot reach those URLs.

On Fri, Jan 13, 2017 at 4:34 PM, Dima Kovalyov 
wrote:

> Hello Nick,
>
> Thanks for the reply! Yeah, I still have it. What CDN does you Kibana is
> using?
> Are you able to access URLs I listed in my original email?
>
> - Dima
>
> On 01/13/2017 10:23 PM, Nick Allen wrote:
> > Hi Dima - I forgot to email earlier, but the tile maps are working fine
> for
> > me using Quick Dev. Still having this problem?
> >
> > On Wed, Jan 11, 2017 at 6:02 PM, Dima Kovalyov 
> > wrote:
> >
> >> Hello,
> >>
> >> I have attached screenshot of how my Tile Map looks like right now
> >> (kibana_tilemap.png).
> >> Basically there is no geo map, only dots that represent location on the
> >> map.
> >>
> >> It appears that geo map images are hosted on external CDN which is now
> >> not available for me. I feel like it is not available for you either.
> >> Is that so?
> >> Please advise.
> >>
> >> Example URLs:
> >> https://otile3-s.mqcdn.com/tiles/1.0.0/map/2/1/1.jpeg
> >> https://otile3-s.mqcdn.com/tiles/1.0.0/map/2/0/2.jpeg
> >>
> >> - Dima
> >>
> >
> >
>
>


-- 
Nick Allen 


[GitHub] incubator-metron pull request #394: METRON-623: Management UI

2017-01-13 Thread merrimanr
Github user merrimanr closed the pull request at:

https://github.com/apache/incubator-metron/pull/394


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com
I like the suggestions you made, Nick.  The only thing I would add is that
it's also nice to see an explicit when(false), as people newer to the
platform may not know where to expect configs for the different writers.
Being able to do it either way, which I think is already assumed in your
model, would make sense.  I would just suggest that, if we support but are
disabling a writer, that the platform inserts a default when(false) to be
explicit.

Jon

On Fri, Jan 13, 2017 at 11:59 AM Casey Stella  wrote:

> Let me noodle on this over the weekend.  Your syntax is looking less
> onerous to me and I like the following statement from Otto: "In the end,
> each write destination ‘type’ will need it’s own configuration.  This is an
> extension point."
>
> I may come around to your way of thinking.
>
> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
> wrote:
>
> > In the end, each write destination ‘type’ will need it’s own
> > configuration.  This is an extension point.
> > {
> > HDFS:{
> > outputAdapters:[
> > {name: avro,
> > settings:{
> > avro stuff….
> > when:{
> > },
> > {
> >  name: sequence file,
> >  …..
> >
> > or some such.
> >
> >
> > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) wrote:
> >
> > I will add also that instead of global overrides, like index, we should
> use
> > configuration key names that are more appropriate to the output.
> >
> > For example, does 'index' really make sense for HDFS? Or would 'path' be
> > more appropriate?
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 1
> > },
> > 'hdfs': {
> > 'path': '/foo/bar/...',
> > 'batchSize': 100
> > }
> > }
> >
> > Ok, I've said my peace. Thanks for the effort in summarizing all this,
> > Casey.
> >
> >
> > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:
> >
> > > Nick's concerns about my suggestion were that it was overly complex and
> > >> hard to grok and that we could dispense with backwards compatibility
> and
> > >> make people do a bit more work on the default case for the benefits
> of a
> > >> simpler advanced case. (Nick, make sure I don't misstate your
> position)
> > >
> > >
> > > I will add is that in my mind, the majority case would be a user
> > > specifying the outputs, but not things like 'batchSize' or 'when'. I
> > think
> > > in the majority case, the user would accept whatever the default batch
> > size
> > > is.
> > >
> > > Here are alternatives suggestions for all the examples that you
> provided
> > > previously.
> > >
> > > Base Case
> > >
> > > - The user must always specify the 'outputs' for clarity.
> > > - Uses default index name, batch size and when = true.
> > >
> > > {
> > > 'elasticsearch': {},
> > > 'hdfs': {}
> > > }
> > >
> > >
> > > <
> > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > a1#writer-non-specific-case>Writer-non-specific
> >
> > > Case
> > >
> > > - There are no global overrides, as in Casey's proposal.
> > > - Easier to grok IMO.
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 100
> > > },
> > > 'hdfs': {
> > > 'index': 'foo',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > >
> > > <
> > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > a1#writer-specific-case-without-filters>Writer-specific
> >
> > > case without filters
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 1
> > > },
> > > 'hdfs': {
> > > 'index': 'foo',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > >
> > > <
> > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > a1#writer-specific-case-with-filters>Writer-specific
> >
> > > case with filters
> > >
> > > - Instead of having to say when=false, just don't configure HDFS
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 100,
> > > 'when': 'exists(field1)'
> > > }
> > > }
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
> > wrote:
> > >
> > >> Dave,
> > >> For the benefit of posterity and people who might not be as deeply
> > >> entangled in the system as we have been, I'll recap things and
> hopefully
> > >> answer your question in the process.
> > >>
> > >> Historically the index configuration is split between the enrichment
> > >> configs and the global configs.
> > >>
> > >> - The global configs really controls configs that apply to all
> sensors.
> > >> Historically this has been stuff like index connection strings, etc.
> > >> - The sensor-specific configs which control things that vary by
> sensor.
> > >>
> > >> As of Metron-652 (in review currently), we moved the sensor specific
> > >> configs from the enrichment configs. The proposal here is to increase
> > the
> > >> granularity of the the sensor specific files to make them support
> index
> > >> writer-specific configs. Right now in the indexing topology, we have 2
> > >> writers (fixed): ES/Solr and 

[GitHub] incubator-metron issue #416: METRON-656: Make Stellar 'in' closer to functio...

2017-01-13 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/416
  
@jjmeyer0 Oh good one!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #416: METRON-656: Make Stellar 'in' closer to ...

2017-01-13 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/416#discussion_r96073648
  
--- Diff: metron-platform/metron-common/README.md ---
@@ -39,11 +40,17 @@ The following keywords need to be single quote escaped 
in order to be used in St
 | <= | \> | \>= |
 | ? | \+ | \- |
 | , | \* | / |
+|  | \* | / |
 
 Using parens such as: "foo" : "\" requires escaping; "foo": 
"\'\\'"
 
+## Stellar Language Inclusion Checks (`in` and `not in`)
+1. `in` supports string contains. e.g. `'foo' in 'foobar' == true`
+2. `in` supports collection contains. e.g. `'foo' in [ 'foo', 'bar' ] == 
true`
+3. `in` supports map key contains. e.g. `'foo' in { 'foo' : 5} == true`
+4. `not in` is the negation of the in expression. e.g. `'grok' not in 
'foobar' == true`
--- End diff --

Yeah, that's a stellar language bug.  `('grok' not in 'foobar') == true` 
should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #416: METRON-656: Make Stellar 'in' closer to functio...

2017-01-13 Thread mmiklavc
Github user mmiklavc commented on the issue:

https://github.com/apache/incubator-metron/pull/416
  
I like the semantics here. One small comment on the tests, +1 pending that 
adjustment and Travis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #416: METRON-656: Make Stellar 'in' closer to ...

2017-01-13 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/416#discussion_r96046421
  
--- Diff: 
metron-platform/metron-common/src/test/java/org/apache/metron/common/stellar/StellarTest.java
 ---
@@ -418,6 +417,33 @@ public void testList() throws Exception {
   }
 
   @Test
+  public void testInMap() throws Exception {
--- End diff --

Sure thing.  Done, let me know what you think about them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen
I will add also that instead of global overrides, like index, we should use
configuration key names that are more appropriate to the output.

For example, does 'index' really make sense for HDFS?  Or would 'path' be
more appropriate?

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 1
},
   'hdfs': {
  'path': '/foo/bar/...',
  'batchSize': 100
}
}

Ok, I've said my peace.  Thanks for the effort in summarizing all this,
Casey.


On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:

> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position)
>
>
> I will add is that in my mind, the majority case would be a user
> specifying the outputs, but not things like 'batchSize' or 'when'.  I think
> in the majority case, the user would accept whatever the default batch size
> is.
>
> Here are alternatives suggestions for all the examples that you provided
> previously.
>
> Base Case
>
>- The user must always specify the 'outputs' for clarity.
>- Uses default index name, batch size and when = true.
>
> {
>'elasticsearch': {},
>'hdfs': {}
> }
>
>
> Writer-non-specific
> Case
>
>- There are no global overrides, as in Casey's proposal.
>- Easier to grok IMO.
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
>
> Writer-specific
> case without filters
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 1
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
>
> Writer-specific
> case with filters
>
>- Instead of having to say when=false, just don't configure HDFS
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100,
>   'when': 'exists(field1)'
> }
> }
>
>
>
>
>
> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella  wrote:
>
>> Dave,
>> For the benefit of posterity and people who might not be as deeply
>> entangled in the system as we have been, I'll recap things and hopefully
>> answer your question in the process.
>>
>> Historically the index configuration is split between the enrichment
>> configs and the global configs.
>>
>>- The global configs really controls configs that apply to all sensors.
>>Historically this has been stuff like index connection strings, etc.
>>- The sensor-specific configs which control things that vary by sensor.
>>
>> As of Metron-652 (in review currently), we moved the sensor specific
>> configs from the enrichment configs.  The proposal here is to increase the
>> granularity of the the sensor specific files to make them support index
>> writer-specific configs.  Right now in the indexing topology, we have 2
>> writers (fixed): ES/Solr and HDFS.
>>
>> The proposed configuration would allow you to either specify a blanket
>> sensor-level config for the index name and batchSize and/or override at
>> the
>> writer level, thereby supporting a couple of use-cases:
>>
>>- Turning off certain index writers (e.g. HDFS)
>>- Filtering the messages written to certain index writers
>>
>> The two competing configs between Nick and I are as follows:
>>
>>- I want to make sure we keep the old sensor-specific defaults with
>>writer-specific overrides available
>>- Nick thought we could simplify the permutations by making the
>> indexing
>>config only the writer-level configs.
>>
>> My concerns about Nick's suggestion were that the default and majority
>> case, specifying the index and the batchSize for all writers (th eone we
>> support now) would require more configuration.
>>
>> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position).
>>
>> Casey
>>
>>
>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle 
>> wrote:
>>
>> > Casey,
>> >
>> > Can you give me a level set of what your thinking is now? I think it's
>> > global control of all index types + overrides on a per-type basis. Fwiw,
>> > I'm totally for that, but I want to make sure I'm not imposing my
>> > pre-concieved notions on your consensus-driven ones.
>> >
>> > -D
>> >
>> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
Let me noodle on this over the weekend.  Your syntax is looking less
onerous to me and I like the following statement from Otto: "In the end,
each write destination ‘type’ will need it’s own configuration.  This is an
extension point."

I may come around to your way of thinking.

On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
wrote:

> In the end, each write destination ‘type’ will need it’s own
> configuration.  This is an extension point.
> {
> HDFS:{
> outputAdapters:[
> {name: avro,
> settings:{
> avro stuff….
> when:{
> },
> {
>  name: sequence file,
>  …..
>
> or some such.
>
>
> On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) wrote:
>
> I will add also that instead of global overrides, like index, we should use
> configuration key names that are more appropriate to the output.
>
> For example, does 'index' really make sense for HDFS? Or would 'path' be
> more appropriate?
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 1
> },
> 'hdfs': {
> 'path': '/foo/bar/...',
> 'batchSize': 100
> }
> }
>
> Ok, I've said my peace. Thanks for the effort in summarizing all this,
> Casey.
>
>
> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:
>
> > Nick's concerns about my suggestion were that it was overly complex and
> >> hard to grok and that we could dispense with backwards compatibility and
> >> make people do a bit more work on the default case for the benefits of a
> >> simpler advanced case. (Nick, make sure I don't misstate your position)
> >
> >
> > I will add is that in my mind, the majority case would be a user
> > specifying the outputs, but not things like 'batchSize' or 'when'. I
> think
> > in the majority case, the user would accept whatever the default batch
> size
> > is.
> >
> > Here are alternatives suggestions for all the examples that you provided
> > previously.
> >
> > Base Case
> >
> > - The user must always specify the 'outputs' for clarity.
> > - Uses default index name, batch size and when = true.
> >
> > {
> > 'elasticsearch': {},
> > 'hdfs': {}
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-non-specific-case>Writer-non-specific
>
> > Case
> >
> > - There are no global overrides, as in Casey's proposal.
> > - Easier to grok IMO.
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 100
> > },
> > 'hdfs': {
> > 'index': 'foo',
> > 'batchSize': 100
> > }
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-specific-case-without-filters>Writer-specific
>
> > case without filters
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 1
> > },
> > 'hdfs': {
> > 'index': 'foo',
> > 'batchSize': 100
> > }
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-specific-case-with-filters>Writer-specific
>
> > case with filters
> >
> > - Instead of having to say when=false, just don't configure HDFS
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 100,
> > 'when': 'exists(field1)'
> > }
> > }
> >
> >
> >
> >
> >
> > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
> wrote:
> >
> >> Dave,
> >> For the benefit of posterity and people who might not be as deeply
> >> entangled in the system as we have been, I'll recap things and hopefully
> >> answer your question in the process.
> >>
> >> Historically the index configuration is split between the enrichment
> >> configs and the global configs.
> >>
> >> - The global configs really controls configs that apply to all sensors.
> >> Historically this has been stuff like index connection strings, etc.
> >> - The sensor-specific configs which control things that vary by sensor.
> >>
> >> As of Metron-652 (in review currently), we moved the sensor specific
> >> configs from the enrichment configs. The proposal here is to increase
> the
> >> granularity of the the sensor specific files to make them support index
> >> writer-specific configs. Right now in the indexing topology, we have 2
> >> writers (fixed): ES/Solr and HDFS.
> >>
> >> The proposed configuration would allow you to either specify a blanket
> >> sensor-level config for the index name and batchSize and/or override at
> >> the
> >> writer level, thereby supporting a couple of use-cases:
> >>
> >> - Turning off certain index writers (e.g. HDFS)
> >> - Filtering the messages written to certain index writers
> >>
> >> The two competing configs between Nick and I are as follows:
> >>
> >> - I want to make sure we keep the old sensor-specific defaults with
> >> writer-specific overrides available
> >> - Nick thought we could simplify the permutations by making the
> >> indexing
> >> config only the writer-level configs.
> >>
> >> My concerns about Nick's suggestion were that the default and majority
> >> case, specifying the index and the batchSize for all writers (th eone we
> >> support now) would require 

[GitHub] incubator-metron pull request #416: METRON-656: Make Stellar 'in' closer to ...

2017-01-13 Thread cestella
GitHub user cestella opened a pull request:

https://github.com/apache/incubator-metron/pull/416

METRON-656: Make Stellar 'in' closer to functioning like python

We have an `in` operator in stellar, but it could be much better. This 
should bring it at parity with the `in` operator in python:
* `in` should support string contains e.g. `'foo' in 'foobar'`
* `in` should support Collection contains e.g. `'foo' in [ 'foo', 'bar' ]`. 
Legacy was to only support lists
* `in` should support map contains e.g. `'foo' in { 'foo' : 5 }`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cestella/incubator-metron METRON-656

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-metron/pull/416.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #416


commit 98d9b948f052eb5614d58d2900b898fb45fbf04f
Author: cstella 
Date:   2017-01-13T17:05:51Z

METRON-656: Add String Contains to Stellar

commit 931589ef3e27a80ee7f94147212124658d4c75db
Author: cstella 
Date:   2017-01-13T17:56:06Z

METRON-656: Make Stellar 'in' closer to functioning like python




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Matt Foley
Gosh I hate being two hours behind you guys! :-)

I’ll go back through the thread and collect open questions, but wanted to put a 
word in about Zookeeper integration.
I had been about to ask what the benefits of using ZK are, and I’ve now heard 
two features:
- A logical “single place” that is visible and efficiently accessible by all 
processes on all nodes.
- Supports async notification, and therefore updates without restarts.

If only the first point was there, it could be replaced by Ambari, because 
Ambari manages propagating configs so they look local to all nodes.  And btw, 
there’s no bottlenecking, because clients don’t call the Ambari REST APIs to 
*read* configs, just to *change* them.  For reading established configs, 
clients just go to those local files, which are managed by the ambari-agents.

But the ability to use ZK and Curator to support async config updates, as we 
do, is really important.  In fact, I think if we make our use of it consistent 
we might offer an example to the Ambari team for a general feature they could 
adopt, precisely to support config changes without restart.

Three additional considerations:

1. HBase has always used ZK for various things.  I don’t know if that includes 
configuration.  If so, that’s already integrated with Ambari.  We should look 
into the details of that.

2. Can folks who’ve been here a while clarify why use of ZK is so piecemeal, 
and scattered several places in the znode tree?  There seems to be a vague idea 
that “things that get changed” go in ZK, while other configs go in local files. 
 But all configs, by definition, can be changed.  Is there any real reason not 
to put the whole Metron configuration in ZK, with a clean and consistent 
directory structure?  It’s okay if Metron cannot actually consume all updates 
asynchronously (like some topology configs that require a topo restart if 
changed).  We just document which configs do and don’t support async change.  
HDFS has lots of those.

3. I’m pretty sure we can suppress the “need to restart” warning from Ambari.  
We’ll need to dig in to find out how flexible this is.

I do think we should continue supporting non-Ambari use, and if we put all 
configs in ZK, that gets way easier to do in a simple and consistent way. 
(Propagation problem solved).  More thoughts after I have ‘em :-)

Thanks,
--Matt


On 1/13/17, 8:30 AM, "Casey Stella"  wrote:

I think that looks good.  One last question, do we support the manual
install use-case (one where ambari does not exist, I mean)?

Casey

On Fri, Jan 13, 2017 at 11:28 AM, David Lyle  wrote:

> That's good feedback, Jon. I think that puts us at:
>
>  - Expand ambari to manage the remaining sensor-specific configs
>  - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
>to push to ambari and take an Ambari user/pw and (optionally) reason
>  - We shall retain current functionality wrt live configuration changes.
> Suggestion- ConfigurationUtils will push to both zookeeper and Ambari in 
an
> atomic operation. (I suspect we can make ambari do this as well)
>  - Refactor the middleware that Ryan submitted to have the API calls take
>  an Ambari user/pw and (optionally) reason
>  - Refactor the management UI to pass in an Ambari user/pw and 
(optionally)
> reason
>  - Refactor the Stellar Management functions CONFIG_PUT to accept an 
Ambari
> user/pw and (optionally) reason
>
> -D...
>
>
> On Fri, Jan 13, 2017 at 11:17 AM, Nick Allen  wrote:
>
> > +1  I strongly agree with Jon's view.   Requiring a restart would be a
> big
> > step backwards.
> >
> > I think the power of the platform is that the user can act on live
> > streaming data in a quick, iterative fashion.  Adding enrichments,
> creating
> > triage rules, adjusting profiles are all operational activities that can
> be
> > performed at any time in response to active threats.
> >
> >
> >
> >
> > On Fri, Jan 13, 2017 at 10:59 AM, zeo...@gmail.com 
> > wrote:
> >
> > > Right, good conversation to bring up for sure.
> > >
> > > Just to comment on production generally only being updated during
> > > maintenance windows - I can tell you that my plans are to make my dev,
> > > test, and prod Metron a very dynamic and frequently changing
> environment
> > > which will have coordinated but frequent modifications and I strongly
> > > prefer not having to restart anywhere that I can.  Of course it will
> > > happen, but keeping it to a minimum is key.
> > >
> > > Jon
> > >
> > > On Fri, Jan 13, 2017 at 10:53 AM Nick Allen 
> wrote:
> > >
> > > > Makes sense, Dave.  I am totally clear on the proposal.  I just
> wanted
> > to
> > > > 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com
Hmm, I'm not sure I agree that in most cases users would accept the default
batch size, especially in sizeable environments.

In search tiers like ES it is very important, and should be tuned to the
specific data that you're sending because it depends on the number of
bytes, not necessarily number of messages.

This makes me wonder if a separate enhancement would be to allow number of
entries OR size of entries to be the batch size.  If that is the case, I
could see a sane default using number of bytes to send to search being more
static.  However, I don't know how realistic that is in Storm.

Jon

On Fri, Jan 13, 2017 at 11:51 AM Nick Allen  wrote:

> >
> > Nick's concerns about my suggestion were that it was overly complex and
> > hard to grok and that we could dispense with backwards compatibility and
> > make people do a bit more work on the default case for the benefits of a
> > simpler advanced case. (Nick, make sure I don't misstate your position)
>
>
> I will add is that in my mind, the majority case would be a user specifying
> the outputs, but not things like 'batchSize' or 'when'.  I think in the
> majority case, the user would accept whatever the default batch size is.
>
> Here are alternatives suggestions for all the examples that you provided
> previously.
>
> Base Case
>
>- The user must always specify the 'outputs' for clarity.
>- Uses default index name, batch size and when = true.
>
> {
>'elasticsearch': {},
>'hdfs': {}
> }
>
> <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-non-specific-case
> >Writer-non-specific
> Case
>
>- There are no global overrides, as in Casey's proposal.
>- Easier to grok IMO.
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
> <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-without-filters
> >Writer-specific
> case without filters
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 1
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
> <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-with-filters
> >Writer-specific
> case with filters
>
>- Instead of having to say when=false, just don't configure HDFS
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100,
>   'when': 'exists(field1)'
> }
> }
>
>
>
>
>
> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella  wrote:
>
> > Dave,
> > For the benefit of posterity and people who might not be as deeply
> > entangled in the system as we have been, I'll recap things and hopefully
> > answer your question in the process.
> >
> > Historically the index configuration is split between the enrichment
> > configs and the global configs.
> >
> >- The global configs really controls configs that apply to all
> sensors.
> >Historically this has been stuff like index connection strings, etc.
> >- The sensor-specific configs which control things that vary by
> sensor.
> >
> > As of Metron-652 (in review currently), we moved the sensor specific
> > configs from the enrichment configs.  The proposal here is to increase
> the
> > granularity of the the sensor specific files to make them support index
> > writer-specific configs.  Right now in the indexing topology, we have 2
> > writers (fixed): ES/Solr and HDFS.
> >
> > The proposed configuration would allow you to either specify a blanket
> > sensor-level config for the index name and batchSize and/or override at
> the
> > writer level, thereby supporting a couple of use-cases:
> >
> >- Turning off certain index writers (e.g. HDFS)
> >- Filtering the messages written to certain index writers
> >
> > The two competing configs between Nick and I are as follows:
> >
> >- I want to make sure we keep the old sensor-specific defaults with
> >writer-specific overrides available
> >- Nick thought we could simplify the permutations by making the
> indexing
> >config only the writer-level configs.
> >
> > My concerns about Nick's suggestion were that the default and majority
> > case, specifying the index and the batchSize for all writers (th eone we
> > support now) would require more configuration.
> >
> > Nick's concerns about my suggestion were that it was overly complex and
> > hard to grok and that we could dispense with backwards compatibility and
> > make people do a bit more work on the default case for the benefits of a
> > simpler advanced case. (Nick, make sure I don't misstate your position).
> >
> > Casey
> >
> >
> > On Fri, Jan 13, 2017 at 10:54 AM, David Lyle 
> wrote:
> >
> > > Casey,
> > >
> > > Can you give me a level set of what your thinking is now? I think it's
> > > global control of all index types + overrides on a per-type 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com
I think Simon has a very valid suggestion.  Additionally, I have a two
questions.  For the following config:

{
  "index" : "foo"
 ,"batchSize" : 100
}

Are now all logs going to the same index?  I read this as a writer-specific
override of the sensor-specific defaults to use an index name of foo* (in
HDFS that's foo, in ES that's foo-${timestamp}).  If that's true, would
something like this work?

{
 "batchSize" : 100
 , "writerConfig" :
   {
  "elasticsearch" : {
   "when" : "exists(field1)",
   "index" : "+foo"
 }
   }
}

How I read this is, set a default batchSize of 100, but for each index
(holding to the sensor-specific defaults), specify an override for
elasticsearch to send to the index foo when field1 exists.  The result in
my mind would be that the sensor-specific default and foo both get this log
line, if field1 exists.

Of course the syntax I used for "+foo" is probably optimal, but just
illustrative that it's appending an additional index to send to, as opposed
to overwriting the destination index (if you didn't add the +).  In fact,
the more I look at it, this appears to be a bad approach but I'm struggling
to think of an exact, cleaner solution to suggest offhand.  Something that
does if(exists(field1); index+=foo.

Also, as previously discussed, this could easily be a follow-on enhancement.

Jon

On Fri, Jan 13, 2017 at 11:18 AM David Lyle  wrote:

Thanks Casey!

I think I had the right of it, but wanted to make sure.

I'm +1 on defaults in global with overrides in sensor-specific. At least in
the first iteration. I (like Otto) suspect we'll have a few go-arounds on
this.

-D...


On Fri, Jan 13, 2017 at 11:09 AM, Otto Fowler 
wrote:

> This is an excellent point
>
>
> On January 13, 2017 at 10:54:07, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Some thing else to consider here is the possibility of multiple indices
> within a given target technology.
>
> For example, if I’m indexing data from a given sensor into, say solr, I
> may want it filtered differently into two different indices. This would
> enable me to create different ‘views’ which could have different security
> settings applied in that backend. This would be useful for multi-tenant
> installs, and for differing data privilege levels within an organisation.
> You could argue that this is more a concern for filtering of the results
> coming out of an index, but currently this is a lot harder than using
> something like the ranger solr authorisation plugin to control access at
an
> index by index granularity.
>
> Essentially, the indexer topology then becomes a filter and router, which
> argues for it being a separate step, before the process which actually
> writes out to each platform. It may also make sense to have a concept of a
> routing key built up by earlier enrichment to allow shuffle control in
> storm, rather than a full stellar statement for routing, to avoid
overhead.
>
> Simon
>
> > On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> >
> > I am suggesting that, yes. The configs are essentially the same as
yours,
> > except there is an override specified at the top level. Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each. It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified
batch
> > size (which is what we have now). Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> >> Are you saying we support all of these variants? I realize you are
> trying
> >> to have some backwards compatibility, but this also makes it harder for
> a
> >> user to grok (for me at least).
> >>
> >> Personally I like my original example as there are fewer
sub-structures,
> >> like 'writerConfig', which makes the whole thing simpler and easier to
> >> grok. But maybe others will think your proposal is just as easy to
grok.
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >>
> >>> Ok, so here's what I'm thinking based on the discussion:
> >>>
> >>> - Keeping the configs that we have now (batchSize and index) as
> >> defaults
> >>> for the unspecified writer-specific case
> >>> - Adding the config Nick suggested
> >>>
> >>> *Base Case*:
> >>> {
> >>> }
> >>>
> >>> - all writers write all messages
> >>> - index named the same as the sensor for all writers
> >>> - batchSize of 1 for all writers
> >>>
> >>> *Writer-non-specific case*:
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 100
> >>> }
> >>>
> >>> - All writers write all messages
> 

[GitHub] incubator-metron pull request #416: METRON-656: Make Stellar 'in' closer to ...

2017-01-13 Thread mmiklavc
Github user mmiklavc commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/416#discussion_r96046647
  
--- Diff: 
metron-platform/metron-common/src/test/java/org/apache/metron/common/stellar/StellarTest.java
 ---
@@ -418,6 +417,33 @@ public void testList() throws Exception {
   }
 
   @Test
+  public void testInMap() throws Exception {
--- End diff --

Looks great! +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler
In the end, each write destination ‘type’ will need it’s own
configuration.  This is an extension point.
{
HDFS:{
outputAdapters:[
{name: avro,
settings:{
avro stuff….
when:{
},
{
 name: sequence file,
 …..

or some such.


On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) wrote:

I will add also that instead of global overrides, like index, we should use
configuration key names that are more appropriate to the output.

For example, does 'index' really make sense for HDFS? Or would 'path' be
more appropriate?

{
'elasticsearch': {
'index': 'foo',
'batchSize': 1
},
'hdfs': {
'path': '/foo/bar/...',
'batchSize': 100
}
}

Ok, I've said my peace. Thanks for the effort in summarizing all this,
Casey.


On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:

> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position)
>
>
> I will add is that in my mind, the majority case would be a user
> specifying the outputs, but not things like 'batchSize' or 'when'. I
think
> in the majority case, the user would accept whatever the default batch
size
> is.
>
> Here are alternatives suggestions for all the examples that you provided
> previously.
>
> Base Case
>
> - The user must always specify the 'outputs' for clarity.
> - Uses default index name, batch size and when = true.
>
> {
> 'elasticsearch': {},
> 'hdfs': {}
> }
>
>
> <
https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-non-specific-case>Writer-non-specific

> Case
>
> - There are no global overrides, as in Casey's proposal.
> - Easier to grok IMO.
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 100
> },
> 'hdfs': {
> 'index': 'foo',
> 'batchSize': 100
> }
> }
>
>
> <
https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-without-filters>Writer-specific

> case without filters
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 1
> },
> 'hdfs': {
> 'index': 'foo',
> 'batchSize': 100
> }
> }
>
>
> <
https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-with-filters>Writer-specific

> case with filters
>
> - Instead of having to say when=false, just don't configure HDFS
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 100,
> 'when': 'exists(field1)'
> }
> }
>
>
>
>
>
> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
wrote:
>
>> Dave,
>> For the benefit of posterity and people who might not be as deeply
>> entangled in the system as we have been, I'll recap things and hopefully
>> answer your question in the process.
>>
>> Historically the index configuration is split between the enrichment
>> configs and the global configs.
>>
>> - The global configs really controls configs that apply to all sensors.
>> Historically this has been stuff like index connection strings, etc.
>> - The sensor-specific configs which control things that vary by sensor.
>>
>> As of Metron-652 (in review currently), we moved the sensor specific
>> configs from the enrichment configs. The proposal here is to increase
the
>> granularity of the the sensor specific files to make them support index
>> writer-specific configs. Right now in the indexing topology, we have 2
>> writers (fixed): ES/Solr and HDFS.
>>
>> The proposed configuration would allow you to either specify a blanket
>> sensor-level config for the index name and batchSize and/or override at
>> the
>> writer level, thereby supporting a couple of use-cases:
>>
>> - Turning off certain index writers (e.g. HDFS)
>> - Filtering the messages written to certain index writers
>>
>> The two competing configs between Nick and I are as follows:
>>
>> - I want to make sure we keep the old sensor-specific defaults with
>> writer-specific overrides available
>> - Nick thought we could simplify the permutations by making the
>> indexing
>> config only the writer-level configs.
>>
>> My concerns about Nick's suggestion were that the default and majority
>> case, specifying the index and the batchSize for all writers (th eone we
>> support now) would require more configuration.
>>
>> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position).
>>
>> Casey
>>
>>
>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle 
>> wrote:
>>
>> > Casey,
>> >
>> > Can you give me a level set of what your thinking is now? I think it's
>> > global control of all index types + overrides on a per-type basis.
Fwiw,
>> > I'm totally for that, but I want to make sure I'm not imposing my
>> > pre-concieved 

[GitHub] incubator-metron pull request #404: METRON-624: Updated Comparison/Equality ...

2017-01-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-metron/pull/404


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen
>
> Nick's concerns about my suggestion were that it was overly complex and
> hard to grok and that we could dispense with backwards compatibility and
> make people do a bit more work on the default case for the benefits of a
> simpler advanced case. (Nick, make sure I don't misstate your position)


I will add is that in my mind, the majority case would be a user specifying
the outputs, but not things like 'batchSize' or 'when'.  I think in the
majority case, the user would accept whatever the default batch size is.

Here are alternatives suggestions for all the examples that you provided
previously.

Base Case

   - The user must always specify the 'outputs' for clarity.
   - Uses default index name, batch size and when = true.

{
   'elasticsearch': {},
   'hdfs': {}
}

Writer-non-specific
Case

   - There are no global overrides, as in Casey's proposal.
   - Easier to grok IMO.

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 100
},
   'hdfs': {
  'index': 'foo',
  'batchSize': 100
}
}

Writer-specific
case without filters

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 1
},
   'hdfs': {
  'index': 'foo',
  'batchSize': 100
}
}

Writer-specific
case with filters

   - Instead of having to say when=false, just don't configure HDFS

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 100,
  'when': 'exists(field1)'
}
}





On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella  wrote:

> Dave,
> For the benefit of posterity and people who might not be as deeply
> entangled in the system as we have been, I'll recap things and hopefully
> answer your question in the process.
>
> Historically the index configuration is split between the enrichment
> configs and the global configs.
>
>- The global configs really controls configs that apply to all sensors.
>Historically this has been stuff like index connection strings, etc.
>- The sensor-specific configs which control things that vary by sensor.
>
> As of Metron-652 (in review currently), we moved the sensor specific
> configs from the enrichment configs.  The proposal here is to increase the
> granularity of the the sensor specific files to make them support index
> writer-specific configs.  Right now in the indexing topology, we have 2
> writers (fixed): ES/Solr and HDFS.
>
> The proposed configuration would allow you to either specify a blanket
> sensor-level config for the index name and batchSize and/or override at the
> writer level, thereby supporting a couple of use-cases:
>
>- Turning off certain index writers (e.g. HDFS)
>- Filtering the messages written to certain index writers
>
> The two competing configs between Nick and I are as follows:
>
>- I want to make sure we keep the old sensor-specific defaults with
>writer-specific overrides available
>- Nick thought we could simplify the permutations by making the indexing
>config only the writer-level configs.
>
> My concerns about Nick's suggestion were that the default and majority
> case, specifying the index and the batchSize for all writers (th eone we
> support now) would require more configuration.
>
> Nick's concerns about my suggestion were that it was overly complex and
> hard to grok and that we could dispense with backwards compatibility and
> make people do a bit more work on the default case for the benefits of a
> simpler advanced case. (Nick, make sure I don't misstate your position).
>
> Casey
>
>
> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle  wrote:
>
> > Casey,
> >
> > Can you give me a level set of what your thinking is now? I think it's
> > global control of all index types + overrides on a per-type basis. Fwiw,
> > I'm totally for that, but I want to make sure I'm not imposing my
> > pre-concieved notions on your consensus-driven ones.
> >
> > -D
> >
> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella 
> wrote:
> >
> > > I am suggesting that, yes.  The configs are essentially the same as
> > yours,
> > > except there is an override specified at the top level.  Without that,
> in
> > > order to specify both HDFS and ES have batch sizes of 100, you have to
> > > explicitly configure each.  It's less that I'm trying to have backwards
> > > compatibility and more that I'm trying to make the majority case easy:
> > both
> > > writers write everything to a specified index name with a specified
> batch
> > > size (which is what we have now).  Beyond that, I want to allow for
> > > specifying an override for the config on a writer-by-writer basis for
> > those
> > > who need it.
> > >
> > > On Fri, Jan 13, 

[GitHub] incubator-metron pull request #416: METRON-656: Make Stellar 'in' closer to ...

2017-01-13 Thread mmiklavc
Github user mmiklavc commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/416#discussion_r96045537
  
--- Diff: 
metron-platform/metron-common/src/test/java/org/apache/metron/common/stellar/StellarTest.java
 ---
@@ -418,6 +417,33 @@ public void testList() throws Exception {
   }
 
   @Test
+  public void testInMap() throws Exception {
--- End diff --

Can we add a couple test cases for String values? 
```
runPredicate("'foo' in { 'foo' : 5 }"...
bar = 'foo'
runPredicate("'foo' in { bar : 5 }"...
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Casey Stella
Ok, I'll try to give the historical context for the parser and enrichment
and indexing topologies since I was the one who made a lot of the mistakes.
;)

In the beginning there was only configuration changes at topology start and
this was done via those flux properties.  As the configs got more
complicated and more multi-dimensional, it became evident that properties
files weren't sufficient to contain them, so we started contemplating
externalizing the configs in JSON files.  During this investigation, we
contemplated using zookeeper because it was one of the usecases for that
tool, apache curator is awesome and it would enable us, later on, to
support runtime changes without topology restart.  So, pieces of the config
that were the most important to change ended up getting implemented into
zookeeper.  We did not go back and consider deeply the remaining configs in
the flux properties for potential retrofit into zookeeper.  We should
probably do eventually for consistency sake.  That being said, I don't know
of any addition to the flux properties for enrichment for a looong time, so
new configs end up in zookeeper.

I can't speak about the rationale for some of the profiler's flux
properties; I'll defer to Nick to cover that if there are questions.

On Fri, Jan 13, 2017 at 11:58 AM, Matt Foley  wrote:

> Gosh I hate being two hours behind you guys! :-)
>
> I’ll go back through the thread and collect open questions, but wanted to
> put a word in about Zookeeper integration.
> I had been about to ask what the benefits of using ZK are, and I’ve now
> heard two features:
> - A logical “single place” that is visible and efficiently accessible by
> all processes on all nodes.
> - Supports async notification, and therefore updates without restarts.
>
> If only the first point was there, it could be replaced by Ambari, because
> Ambari manages propagating configs so they look local to all nodes.  And
> btw, there’s no bottlenecking, because clients don’t call the Ambari REST
> APIs to *read* configs, just to *change* them.  For reading established
> configs, clients just go to those local files, which are managed by the
> ambari-agents.
>
> But the ability to use ZK and Curator to support async config updates, as
> we do, is really important.  In fact, I think if we make our use of it
> consistent we might offer an example to the Ambari team for a general
> feature they could adopt, precisely to support config changes without
> restart.
>
> Three additional considerations:
>
> 1. HBase has always used ZK for various things.  I don’t know if that
> includes configuration.  If so, that’s already integrated with Ambari.  We
> should look into the details of that.
>
> 2. Can folks who’ve been here a while clarify why use of ZK is so
> piecemeal, and scattered several places in the znode tree?  There seems to
> be a vague idea that “things that get changed” go in ZK, while other
> configs go in local files.  But all configs, by definition, can be
> changed.  Is there any real reason not to put the whole Metron
> configuration in ZK, with a clean and consistent directory structure?  It’s
> okay if Metron cannot actually consume all updates asynchronously (like
> some topology configs that require a topo restart if changed).  We just
> document which configs do and don’t support async change.  HDFS has lots of
> those.
>
> 3. I’m pretty sure we can suppress the “need to restart” warning from
> Ambari.  We’ll need to dig in to find out how flexible this is.
>
> I do think we should continue supporting non-Ambari use, and if we put all
> configs in ZK, that gets way easier to do in a simple and consistent way.
> (Propagation problem solved).  More thoughts after I have ‘em :-)
>
> Thanks,
> --Matt
>
>
> On 1/13/17, 8:30 AM, "Casey Stella"  wrote:
>
> I think that looks good.  One last question, do we support the manual
> install use-case (one where ambari does not exist, I mean)?
>
> Casey
>
> On Fri, Jan 13, 2017 at 11:28 AM, David Lyle 
> wrote:
>
> > That's good feedback, Jon. I think that puts us at:
> >
> >  - Expand ambari to manage the remaining sensor-specific configs
> >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I
> think)
> >to push to ambari and take an Ambari user/pw and (optionally)
> reason
> >  - We shall retain current functionality wrt live configuration
> changes.
> > Suggestion- ConfigurationUtils will push to both zookeeper and
> Ambari in an
> > atomic operation. (I suspect we can make ambari do this as well)
> >  - Refactor the middleware that Ryan submitted to have the API calls
> take
> >  an Ambari user/pw and (optionally) reason
> >  - Refactor the management UI to pass in an Ambari user/pw and
> (optionally)
> > reason
> >  - Refactor the Stellar Management functions CONFIG_PUT to accept an
> Ambari
> > user/pw and (optionally) reason

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
One thing that I really like about Nick's suggestion is that it allows
writer-specific configs in a clear and simple way.  It is more complex for
the default case (all writers write to indices named the same thing with a
fixed batch size), which I do not like, but maybe it's worth the compromise
to make it less complex for the advanced case.

Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning to
lean your way.

On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com  wrote:

> I like the suggestions you made, Nick.  The only thing I would add is that
> it's also nice to see an explicit when(false), as people newer to the
> platform may not know where to expect configs for the different writers.
> Being able to do it either way, which I think is already assumed in your
> model, would make sense.  I would just suggest that, if we support but are
> disabling a writer, that the platform inserts a default when(false) to be
> explicit.
>
> Jon
>
> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella  wrote:
>
> > Let me noodle on this over the weekend.  Your syntax is looking less
> > onerous to me and I like the following statement from Otto: "In the end,
> > each write destination ‘type’ will need it’s own configuration.  This is
> an
> > extension point."
> >
> > I may come around to your way of thinking.
> >
> > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
> > wrote:
> >
> > > In the end, each write destination ‘type’ will need it’s own
> > > configuration.  This is an extension point.
> > > {
> > > HDFS:{
> > > outputAdapters:[
> > > {name: avro,
> > > settings:{
> > > avro stuff….
> > > when:{
> > > },
> > > {
> > >  name: sequence file,
> > >  …..
> > >
> > > or some such.
> > >
> > >
> > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org)
> wrote:
> > >
> > > I will add also that instead of global overrides, like index, we should
> > use
> > > configuration key names that are more appropriate to the output.
> > >
> > > For example, does 'index' really make sense for HDFS? Or would 'path'
> be
> > > more appropriate?
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 1
> > > },
> > > 'hdfs': {
> > > 'path': '/foo/bar/...',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > > Ok, I've said my peace. Thanks for the effort in summarizing all this,
> > > Casey.
> > >
> > >
> > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen 
> wrote:
> > >
> > > > Nick's concerns about my suggestion were that it was overly complex
> and
> > > >> hard to grok and that we could dispense with backwards compatibility
> > and
> > > >> make people do a bit more work on the default case for the benefits
> > of a
> > > >> simpler advanced case. (Nick, make sure I don't misstate your
> > position)
> > > >
> > > >
> > > > I will add is that in my mind, the majority case would be a user
> > > > specifying the outputs, but not things like 'batchSize' or 'when'. I
> > > think
> > > > in the majority case, the user would accept whatever the default
> batch
> > > size
> > > > is.
> > > >
> > > > Here are alternatives suggestions for all the examples that you
> > provided
> > > > previously.
> > > >
> > > > Base Case
> > > >
> > > > - The user must always specify the 'outputs' for clarity.
> > > > - Uses default index name, batch size and when = true.
> > > >
> > > > {
> > > > 'elasticsearch': {},
> > > > 'hdfs': {}
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-non-specific-case>Writer-non-specific
> > >
> > > > Case
> > > >
> > > > - There are no global overrides, as in Casey's proposal.
> > > > - Easier to grok IMO.
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-without-filters>Writer-specific
> > >
> > > > case without filters
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 1
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-with-filters>Writer-specific
> > >
> > > > case with filters
> > > >
> > > > - Instead of having to say when=false, just don't configure HDFS
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100,
> > > > 'when': 'exists(field1)'
> > > > }
> > > > }
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
> > > wrote:
> > > >
> > > >> Dave,
> > > >> For the benefit of posterity and people who might not be as deeply
> > > >> 

Re: [DISCUSS] Hosting Kraken maven artifacts in incubator-metron git repo

2017-01-13 Thread JJ Meyer
Has anyone used git's sub-modules before? My understanding is you just
point to an external repository. So *technically* I do not think the code
would be hosted in the main repo. Even if that was allowed, I have concerns
about how inactive the repo is. Could we fork this, make our changes, and
submit it to another apache project that it may make sense under? Is that
even allowed under their license?

On Fri, Jan 13, 2017 at 5:35 PM, Matt Foley  wrote:

> Perhaps it would be more appropriate to put it under
> https://dist.apache.org/repos/dist/release/incubator/metron/ , perhaps as
> https://dist.apache.org/repos/dist/release/incubator/metron/mvn-repo ?
>
> We should not host anything with a license that isn’t compatible with
> inclusion in an Apache project.  If we post only non-source artifacts, then
> that would include packages with “Category B List” licenses (that is,
> ‘"WEAK COPYLEFT" LICENSES’) as well as “Category A List” licenses (those
> “SIMILAR IN TERMS TO THE APACHE LICENSE 2.0”) -- per
> https://www.apache.org/legal/resolved .  For versioning, we could simply
> structure as a maven repo, and in fact that’s what I think we should do.
>
> Hosting the source code is not, I think, something we are supposed to do
> for non-Apache projects: https://www.apache.org/legal/resolved again,
> this time the very first question:
>
> CAN ASF PMCS HOST PROJECTS THAT ARE NOT UNDER THE APACHE LICENSE?
> No. See the Apache Software Foundation licenses page for more details,
> and the Apache Software Foundation page for additional background.
>
>
> On 1/13/17, 8:11 AM, "Billie Rinaldi"  wrote:
>
> No, we can't host artifacts in a git repo, or on a website. It would be
> like distributing a release that hasn't been voted upon.
>
> Regarding message threading, in Gmail adding a [tag] to the subject
> does
> not create a new thread. So the change is not visible in my mailbox
> unless
> the rest of the subject is changed as well.
>
> On Mon, Jan 9, 2017 at 1:00 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > This is a question primarily for the mentors.
> >
> > *Background*
> > metron-common is currently depending on the openSOC github repo for
> hosting
> > kraken artifacts. The original reason for this was that these jars
> are not
> > hosted in Maven Central, and they were not reliably available in the
> Kraken
> > repo. https://issues.apache.org/jira/browse/METRON-650 is tracking
> work
> > around copying these artifacts to the Metron repo.
> >
> > Kraken source on openSOC - https://github.com/OpenSOC/kraken
> > Krake maven repo on openSOC -
> > https://github.com/OpenSOC/kraken/tree/mvn-repo
> >
> > *Ask*
> > Create a new branch in incubator-metron to host any necessary maven
> > artifacts. This branch would simply be incubator-metron/mvn-repo.
> This is
> > similar to how we've hosted the asf-site.
> >
> > *Concerns/Questions*
> >
> >1. Can we host these jars/artifacts in this manner?
> >2. Concerns regarding licensing?
> >3. Do we need to also grab and host the source code?
> >
>
>
>
>
>


Re: [DISCUSS] Hosting Kraken maven artifacts in incubator-metron git repo

2017-01-13 Thread Matt Foley
Perhaps it would be more appropriate to put it under 
https://dist.apache.org/repos/dist/release/incubator/metron/ , perhaps as 
https://dist.apache.org/repos/dist/release/incubator/metron/mvn-repo ?

We should not host anything with a license that isn’t compatible with inclusion 
in an Apache project.  If we post only non-source artifacts, then that would 
include packages with “Category B List” licenses (that is, ‘"WEAK COPYLEFT" 
LICENSES’) as well as “Category A List” licenses (those “SIMILAR IN TERMS TO 
THE APACHE LICENSE 2.0”) -- per  https://www.apache.org/legal/resolved .  For 
versioning, we could simply structure as a maven repo, and in fact that’s what 
I think we should do.

Hosting the source code is not, I think, something we are supposed to do for 
non-Apache projects: https://www.apache.org/legal/resolved again, this time the 
very first question:

CAN ASF PMCS HOST PROJECTS THAT ARE NOT UNDER THE APACHE LICENSE?
No. See the Apache Software Foundation licenses page for more details, and 
the Apache Software Foundation page for additional background.


On 1/13/17, 8:11 AM, "Billie Rinaldi"  wrote:

No, we can't host artifacts in a git repo, or on a website. It would be
like distributing a release that hasn't been voted upon.

Regarding message threading, in Gmail adding a [tag] to the subject does
not create a new thread. So the change is not visible in my mailbox unless
the rest of the subject is changed as well.

On Mon, Jan 9, 2017 at 1:00 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> This is a question primarily for the mentors.
>
> *Background*
> metron-common is currently depending on the openSOC github repo for 
hosting
> kraken artifacts. The original reason for this was that these jars are not
> hosted in Maven Central, and they were not reliably available in the 
Kraken
> repo. https://issues.apache.org/jira/browse/METRON-650 is tracking work
> around copying these artifacts to the Metron repo.
>
> Kraken source on openSOC - https://github.com/OpenSOC/kraken
> Krake maven repo on openSOC -
> https://github.com/OpenSOC/kraken/tree/mvn-repo
>
> *Ask*
> Create a new branch in incubator-metron to host any necessary maven
> artifacts. This branch would simply be incubator-metron/mvn-repo. This is
> similar to how we've hosted the asf-site.
>
> *Concerns/Questions*
>
>1. Can we host these jars/artifacts in this manner?
>2. Concerns regarding licensing?
>3. Do we need to also grab and host the source code?
>






Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Nick Allen
+1  I strongly agree with Jon's view.   Requiring a restart would be a big
step backwards.

I think the power of the platform is that the user can act on live
streaming data in a quick, iterative fashion.  Adding enrichments, creating
triage rules, adjusting profiles are all operational activities that can be
performed at any time in response to active threats.




On Fri, Jan 13, 2017 at 10:59 AM, zeo...@gmail.com  wrote:

> Right, good conversation to bring up for sure.
>
> Just to comment on production generally only being updated during
> maintenance windows - I can tell you that my plans are to make my dev,
> test, and prod Metron a very dynamic and frequently changing environment
> which will have coordinated but frequent modifications and I strongly
> prefer not having to restart anywhere that I can.  Of course it will
> happen, but keeping it to a minimum is key.
>
> Jon
>
> On Fri, Jan 13, 2017 at 10:53 AM Nick Allen  wrote:
>
> > Makes sense, Dave.  I am totally clear on the proposal.  I just wanted to
> > ask the stupid question to bring the conversation full circle, leave no
> > stone unturned, insert favorite idiom here.
> >
> > On Fri, Jan 13, 2017 at 10:46 AM, David Lyle 
> wrote:
> >
> > > To be clear- NOBODY is suggesting replacing Zookeeper with Ambari.
> > >
> > > So, as a bit of a reset- here's what's being proposed:
> > >
> > >  - Expand ambari to manage the remaining sensor-specific configs
> > >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I
> think)
> > >to push to ambari and take an Ambari user/pw and (optionally) reason
> > >  - (Ambari can push to zookeeper, but it requires a service restart, so
> > for
> > > "live changes" you may
> > > want do both a rest call and zookeeper update from
> > ConfigurationUtils)
> > > WAS
> > > Question remains about whether ambari can do the push to zookeeper
> > > or whetheror whether ConfigurationUtils has to push to zookeeper as
> > > well as update
> > > ambari.
> > >   - Refactor the middleware that Ryan submitted to have the API calls
> > take
> > >  an Ambari user/pw and (optionally) reason
> > >   - Refactor the management UI to pass in an Ambari user/pw and
> > > (optionally) reason
> > >   - Refactor the Stellar Management functions CONFIG_PUT to accept an
> > > Ambari user/pw and (optionally) reason
> > >
> > > -D...
> > >
> > >
> > >
> > > On Fri, Jan 13, 2017 at 10:42 AM, Ryan Merriman 
> > > wrote:
> > >
> > > > The driver for using Zookeeper is that it is asynchronous and accepts
> > > > callbacks.  Ambari would need to have that capability, otherwise we
> > have
> > > to
> > > > poll which is a deal breaker in my opinion.
> > > >
> > > > On Fri, Jan 13, 2017 at 9:28 AM, Casey Stella 
> > > wrote:
> > > >
> > > > > No, it was good to bring up, Nick.  I might have it wrong re:
> Ambari.
> > > > >
> > > > > Casey
> > > > >
> > > > > On Fri, Jan 13, 2017 at 10:27 AM, Nick Allen 
> > > wrote:
> > > > >
> > > > > > That makes sense.  I wasn't sure based on Matt's original
> > > > > > suggestion/description of Ambari, whether that was something that
> > > > Ambari
> > > > > > had also designed for or not.
> > > > > >
> > > > > > On Fri, Jan 13, 2017 at 10:14 AM, Casey Stella <
> ceste...@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Polling the Ambari server via REST (or their API if they have
> > one),
> > > > > would
> > > > > > > entail all workers hitting one server and create a single point
> > of
> > > > > > failure
> > > > > > > (the ambari server is what serves up REST).  Zookeeper's intent
> > is
> > > to
> > > > > not
> > > > > > > have a single point of failure like this and (one of its main)
> > > > > use-cases
> > > > > > is
> > > > > > > to serve up configs in a distributed environment.
> > > > > > >
> > > > > > > Casey
> > > > > > >
> > > > > > > On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen <
> n...@nickallen.org>
> > > > > wrote:
> > > > > > >
> > > > > > > > Let me ask a stupid question.  What does Zookeeper do for us
> > that
> > > > > > Ambari
> > > > > > > > cannot?  Why keep Zookeeper in the mix?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jan 13, 2017 at 9:28 AM, David Lyle <
> > > dlyle65...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > In the main yes- I've made some changes:
> > > > > > > > >
> > > > > > > > >  - Expand ambari to manage the remaining sensor-specific
> > > configs
> > > > > > > > >  - Refactor the push calls to zookeeper (in
> > > ConfigurationUtils, I
> > > > > > > think)
> > > > > > > > >to push to ambari and take an Ambari user/pw and
> > > (optionally)
> > > > > > reason
> > > > > > > > >  - (Ambari can push to zookeeper, but it requires a service
> > > > > restart,
> > > > > > so
> > > > > > > > for
> > > > > > > > > "live changes" you may
> > > > > > > > > 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread David Lyle
Thanks Casey!

I think I had the right of it, but wanted to make sure.

I'm +1 on defaults in global with overrides in sensor-specific. At least in
the first iteration. I (like Otto) suspect we'll have a few go-arounds on
this.

-D...


On Fri, Jan 13, 2017 at 11:09 AM, Otto Fowler 
wrote:

> This is an excellent point
>
>
> On January 13, 2017 at 10:54:07, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Some thing else to consider here is the possibility of multiple indices
> within a given target technology.
>
> For example, if I’m indexing data from a given sensor into, say solr, I
> may want it filtered differently into two different indices. This would
> enable me to create different ‘views’ which could have different security
> settings applied in that backend. This would be useful for multi-tenant
> installs, and for differing data privilege levels within an organisation.
> You could argue that this is more a concern for filtering of the results
> coming out of an index, but currently this is a lot harder than using
> something like the ranger solr authorisation plugin to control access at an
> index by index granularity.
>
> Essentially, the indexer topology then becomes a filter and router, which
> argues for it being a separate step, before the process which actually
> writes out to each platform. It may also make sense to have a concept of a
> routing key built up by earlier enrichment to allow shuffle control in
> storm, rather than a full stellar statement for routing, to avoid overhead.
>
> Simon
>
> > On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> >
> > I am suggesting that, yes. The configs are essentially the same as yours,
> > except there is an override specified at the top level. Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each. It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified batch
> > size (which is what we have now). Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> >> Are you saying we support all of these variants? I realize you are
> trying
> >> to have some backwards compatibility, but this also makes it harder for
> a
> >> user to grok (for me at least).
> >>
> >> Personally I like my original example as there are fewer sub-structures,
> >> like 'writerConfig', which makes the whole thing simpler and easier to
> >> grok. But maybe others will think your proposal is just as easy to grok.
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >>
> >>> Ok, so here's what I'm thinking based on the discussion:
> >>>
> >>> - Keeping the configs that we have now (batchSize and index) as
> >> defaults
> >>> for the unspecified writer-specific case
> >>> - Adding the config Nick suggested
> >>>
> >>> *Base Case*:
> >>> {
> >>> }
> >>>
> >>> - all writers write all messages
> >>> - index named the same as the sensor for all writers
> >>> - batchSize of 1 for all writers
> >>>
> >>> *Writer-non-specific case*:
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 100
> >>> }
> >>>
> >>> - All writers write all messages
> >>> - index is named "foo", different from the sensor for all writers
> >>> - batchSize is 100 for all writers
> >>>
> >>> *Writer-specific case without filters*
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>> {
> >>> "elasticsearch" : {
> >>> "batchSize" : 100
> >>> }
> >>> }
> >>> }
> >>>
> >>> - All writers write all messages
> >>> - index is named "foo", different from the sensor for all writers
> >>> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> >>> - NOTE: I could override the index name too
> >>>
> >>> *Writer-specific case with filters*
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>> {
> >>> "elasticsearch" : {
> >>> "batchSize" : 100,
> >>> "when" : "exists(field1)"
> >>> },
> >>> "hdfs" : {
> >>> "when" : "false"
> >>> }
> >>> }
> >>> }
> >>>
> >>> - ES writer writes messages which have field1, HDFS doesn't
> >>> - index is named "foo", different from the sensor for all writers
> >>> - 100 for elasticsearch writers
> >>>
> >>> Thoughts?
> >>>
> >>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> >>> wrote:
> >>>
>  For larger installations you need to control what is indexed so you
> >> don’t
>  end up with a nasty elastic search situation and so you can mine the
> >> data
>  later for reports and training ml models.
> 
>  Thanks
>  Carolyn
> 
> 
> 
> 
>  On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread David Lyle
That's good feedback, Jon. I think that puts us at:

 - Expand ambari to manage the remaining sensor-specific configs
 - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
   to push to ambari and take an Ambari user/pw and (optionally) reason
 - We shall retain current functionality wrt live configuration changes.
Suggestion- ConfigurationUtils will push to both zookeeper and Ambari in an
atomic operation. (I suspect we can make ambari do this as well)
 - Refactor the middleware that Ryan submitted to have the API calls take
 an Ambari user/pw and (optionally) reason
 - Refactor the management UI to pass in an Ambari user/pw and (optionally)
reason
 - Refactor the Stellar Management functions CONFIG_PUT to accept an Ambari
user/pw and (optionally) reason

-D...


On Fri, Jan 13, 2017 at 11:17 AM, Nick Allen  wrote:

> +1  I strongly agree with Jon's view.   Requiring a restart would be a big
> step backwards.
>
> I think the power of the platform is that the user can act on live
> streaming data in a quick, iterative fashion.  Adding enrichments, creating
> triage rules, adjusting profiles are all operational activities that can be
> performed at any time in response to active threats.
>
>
>
>
> On Fri, Jan 13, 2017 at 10:59 AM, zeo...@gmail.com 
> wrote:
>
> > Right, good conversation to bring up for sure.
> >
> > Just to comment on production generally only being updated during
> > maintenance windows - I can tell you that my plans are to make my dev,
> > test, and prod Metron a very dynamic and frequently changing environment
> > which will have coordinated but frequent modifications and I strongly
> > prefer not having to restart anywhere that I can.  Of course it will
> > happen, but keeping it to a minimum is key.
> >
> > Jon
> >
> > On Fri, Jan 13, 2017 at 10:53 AM Nick Allen  wrote:
> >
> > > Makes sense, Dave.  I am totally clear on the proposal.  I just wanted
> to
> > > ask the stupid question to bring the conversation full circle, leave no
> > > stone unturned, insert favorite idiom here.
> > >
> > > On Fri, Jan 13, 2017 at 10:46 AM, David Lyle 
> > wrote:
> > >
> > > > To be clear- NOBODY is suggesting replacing Zookeeper with Ambari.
> > > >
> > > > So, as a bit of a reset- here's what's being proposed:
> > > >
> > > >  - Expand ambari to manage the remaining sensor-specific configs
> > > >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I
> > think)
> > > >to push to ambari and take an Ambari user/pw and (optionally)
> reason
> > > >  - (Ambari can push to zookeeper, but it requires a service restart,
> so
> > > for
> > > > "live changes" you may
> > > > want do both a rest call and zookeeper update from
> > > ConfigurationUtils)
> > > > WAS
> > > > Question remains about whether ambari can do the push to
> zookeeper
> > > > or whetheror whether ConfigurationUtils has to push to zookeeper
> as
> > > > well as update
> > > > ambari.
> > > >   - Refactor the middleware that Ryan submitted to have the API calls
> > > take
> > > >  an Ambari user/pw and (optionally) reason
> > > >   - Refactor the management UI to pass in an Ambari user/pw and
> > > > (optionally) reason
> > > >   - Refactor the Stellar Management functions CONFIG_PUT to accept an
> > > > Ambari user/pw and (optionally) reason
> > > >
> > > > -D...
> > > >
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 10:42 AM, Ryan Merriman  >
> > > > wrote:
> > > >
> > > > > The driver for using Zookeeper is that it is asynchronous and
> accepts
> > > > > callbacks.  Ambari would need to have that capability, otherwise we
> > > have
> > > > to
> > > > > poll which is a deal breaker in my opinion.
> > > > >
> > > > > On Fri, Jan 13, 2017 at 9:28 AM, Casey Stella 
> > > > wrote:
> > > > >
> > > > > > No, it was good to bring up, Nick.  I might have it wrong re:
> > Ambari.
> > > > > >
> > > > > > Casey
> > > > > >
> > > > > > On Fri, Jan 13, 2017 at 10:27 AM, Nick Allen  >
> > > > wrote:
> > > > > >
> > > > > > > That makes sense.  I wasn't sure based on Matt's original
> > > > > > > suggestion/description of Ambari, whether that was something
> that
> > > > > Ambari
> > > > > > > had also designed for or not.
> > > > > > >
> > > > > > > On Fri, Jan 13, 2017 at 10:14 AM, Casey Stella <
> > ceste...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Polling the Ambari server via REST (or their API if they have
> > > one),
> > > > > > would
> > > > > > > > entail all workers hitting one server and create a single
> point
> > > of
> > > > > > > failure
> > > > > > > > (the ambari server is what serves up REST).  Zookeeper's
> intent
> > > is
> > > > to
> > > > > > not
> > > > > > > > have a single point of failure like this and (one of its
> main)
> > > > > > use-cases
> > > > > > > is
> > > > > > > > to serve up configs 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
Yeah, I tend to like the first option too.  Any opposition to that from
anyone?

The points brought up are good ones and I think that it may be worth a
broader discussion of the requirements of indexing in a separate dev list
thread.  Maybe a list of desires with coherent use-cases justifying them so
we can think about how this stuff should work and where the natural
extension points should be.  Afterall, we need to toe the line between
engineering and overengineering for features nobody will want.

I'm not sure about the extensions to the standard fields.  I'm torn between
the notions that we should have no standard fields vs we should have a
boatload of standard fields (with most of them empty).  I exchange
positions fairly regularly on that question. ;)  It may be worth a dev list
discussion to lay out how you imagine an extension of standard fields and
how it might look as implemented in Metron.

Casey

Casey

On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson 
wrote:

> I'll second my preference for the first option. I think the ability to use
> Stellar filters to customize indexing would be a big win.
>
> I'm glad Matt brought up the point about data lake and CEP. I think this is
> a really important use case that we need to consider. Take a simple
> example... If I have data coming in from 3 different firewall vendors and 2
> different web proxy/url filtering vendors and I want to be able to analyze
> that data set, I need the data to be indexed all together (likely in HDFS)
> and to have a normalized schema such that IP address, URL, and user name
> (to take a few) can be easily queried and aggregated. I can also envision
> scenarios where I would want to index data based on attributes other than
> sensor, business unit or subsidiary for example.
>
> I've been wanted to propose extending our 7 standard fields to include
> things like URL and user. Is there community interest/support for moving in
> that direction? If so, I'll start a new thread.
>
> Thanks!
>
> -Kyle
>
> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
>
> > Ah, I see.  If overriding the default index name allows using the same
> > name for multiple sensors, then the goal can be achieved.
> > Thanks,
> > --Matt
> >
> >
> > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
> >
> > Oh, you could!  Let's say you have a syslog parser with data from
> > sources 1
> > 2 and 3.  You'd end up with one kafka queue with 3 parsers attached
> to
> > that
> > queue, each picking part the messages from source 1, 2 and 3.  They'd
> > go
> > through separate enrichment and into the indexing topology.  In the
> > indexing topology, you could specify the same index name "syslog" and
> > all
> > of the messages go into the same index for CEP querying if so
> desired.
> >
> > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
> wrote:
> >
> > > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > previous
> > > life.  It makes perfect sense to route different lines from syslog
> > through
> > > different appropriate parsers.  But a lot of what the parsers do is
> > > identify consistent subsets of metadata and annotate it – eg,
> > src_ip_addr,
> > > event timestamps, etc.  Once those metadata are annotated and
> > available
> > > with common field names, why doesn’t it make sense to index the
> > messages
> > > together, for CEP querying?  I think Splunk has illustrated this
> > model.
> > >
> > > On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
> > >
> > > yeah, I mean, honestly, I think the approach that we've taken
> for
> > > sources
> > > which aggregate different types of data is to provide filters
> at
> > the
> > > parser
> > > level and have multiple parser topologies (with different,
> > possibly
> > > mutually exclusive filters) running.  This would be a
> completely
> > > separate
> > > sensor.  Imagine a syslog data source that aggregates and you
> > want to
> > > pick
> > > apart certain pieces of messages.  This is why the initial
> > thought and
> > > architecture was one index per sensor.
> > >
> > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley 
> > wrote:
> > >
> > > > I’m thinking that CEP (Complex Event Processing) is contrary
> > to the
> > > idea
> > > > of silo-ing data per sensor.
> > > > Now it’s true that some of those sensors are already
> > aggregating
> > > data from
> > > > multiple sources, so maybe I’m wrong here.
> > > > But it just seems to me that the “data lake” insights come
> from
> > > being able
> > > > to make decisions over the whole mass of data rather than
> just
> > > vertical
> > > > slices of it.
> > > >
> > > > On 1/12/17, 2:15 PM, "Casey Stella" 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Casey Stella
I would suggest not having Ambari replace zookeeper.  I think the proposal
is to have Ambari replace the editable store (like the JSON files on
disk).  Zookeeper woudl be the source of truth for the running topologies
and ambari would be sync'd to it.

Correct if I misspeak, dave or matt.

Casey

On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen  wrote:

> Ambari seems like a logical choice.
>
> *>> It doesn’t natively integrate Zookeeper storage of configs, but there
> is a natural place to specify copy to/from Zookeeper for the files
> desired.*
>
> How would Ambari interact with Zookeeper in this scenario?  Would Ambari
> replace Zookeeper completely? Or would Zookeeper act as the persistence
> tier under Ambari?
>
>
>
>
> On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley  wrote:
>
> > Mike, could you try again on the image, please, making sure it is a
> simple
> > format (gif, png, or jpeg)?  It got munched, at least in my viewer.
> Thanks.
> >
> > Casey, responding to some of the questions you raised:
> >
> > I’m going to make a rather strong statement:  We already have a service
> > “to intermediate and handle config update/retrieval”.
> > Furthermore, it:
> > - Correctly handles the problems of distributed services running on
> > multi-node clusters.  (That’s a HARD problem, people, and we shouldn’t
> try
> > to reinvent the wheel.)
> > - Correctly handles Kerberos security. (That’s kinda hard too, or at
> least
> > a lot of work.)
> > - It does automatic versioning of configurations, and allows viewing,
> > comparing, and reverting historical configs
> > - It has a capable REST API for all those things.
> > It doesn’t natively integrate Zookeeper storage of configs, but there is
> a
> > natural place to specify copy to/from Zookeeper for the files desired.
> >
> > It is Ambari.  And we should commit to it, rather than try to re-create
> > such features.
> > Because it has a good REST API, it is perfectly feasible to implement
> > Stellar functions that call it.
> > GUI configuration tools can also use the Ambari APIs, or better yet be
> > integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler
> > Configuration Tool” example in the Ambari documentation, under “Using
> > Ambari Views”.)
> >
> > Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and Not
> > spending weeks and weeks of developer time over the next year reinventing
> > the wheel while getting details wrong multiple times…
> >
> > Okay, off soapbox.
> >
> > Casey asked what the config update behavior of Ambari is, and how it will
> > interact with changes made from outside Ambari.
> > The following is from my experience working with the Ambari Mpack for
> > Metron.  I am not otherwise an Ambari expert, so tomorrow I’ll get it
> > reviewed by an Ambari development engineer.
> >
> > Ambari-server runs on one node, and Ambari-agent runs on each of all the
> > nodes.
> > Ambari-server has a private set of py, xml, and template files, which
> > together are used both to generate the Ambari configuration GUI, with
> > defaults, and to generate configuration files (of any needed filetype)
> for
> > the various Stack components.
> > Ambari-server also has a database where it stores the schema related to
> > these files, so even if you reach in and edit Ambari’s files, it will
> Error
> > out if the set of parameters or parameter names changes.  The historical
> > information about configuration changes is also stored in the db.
> > For each component (and in the case of Metron, for each topology), there
> > is a python file which controls the logic for these actions, among
> others:
> > - Install
> > - Start / stop / restart / status
> > - Configure
> >
> > It is actually up to this python code (which we wrote for the Metron
> > Mpack) what happens in each of these API calls.  But the current code,
> and
> > I believe this is typical of Ambari-managed components, performs a
> > “Configure” action whenever you press the “Save” button after changing a
> > component config in Ambari, and also on each Install and Start or
> Restart.
> >
> > The Configure action consists of approximately the following sequence
> (see
> > disclaimer above :-)
> > - Recreate the generated config files, using the template files and the
> > actual configuration most recently set in Ambari
> > o Note this is also under the control of python code that we wrote, and
> > this is the appropriate place to push to ZK if desired.
> > - Propagate those config files to each Ambari-agent, with a command to
> set
> > them locally
> > - The ambari-agents on each node receive the files and write them to the
> > specified locations on local storage
> >
> > Ambari-server then whines that the updated services should be restarted,
> > but does not initiate that action itself (unless of course the initiating
> > action was a Start command from the administrator).
> >
> > Make sense?  It’s all quite straightforward in concept, there’s 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
OH that's a good idea!

On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:

> I like the "Index Filtering" option based on the flexibility that it
> provides.  Should each output (HDFS, ES, etc) have its own configuration
> settings?  For example, aren't things like batching handled separately for
> HDFS versus Elasticsearch?
>
> Something along the lines of...
>
> {
>   "hdfs" : {
> "when": "exists(field1)",
> "batchSize": 100
>   },
>
>   "elasticsearch" : {
> "when": "true",
> "batchSize": 1000,
> "index": "squid"
>   }
> }
>
>
>
>
>
>
>
>
> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  wrote:
>
> > Yeah, I tend to like the first option too.  Any opposition to that from
> > anyone?
> >
> > The points brought up are good ones and I think that it may be worth a
> > broader discussion of the requirements of indexing in a separate dev list
> > thread.  Maybe a list of desires with coherent use-cases justifying them
> so
> > we can think about how this stuff should work and where the natural
> > extension points should be.  Afterall, we need to toe the line between
> > engineering and overengineering for features nobody will want.
> >
> > I'm not sure about the extensions to the standard fields.  I'm torn
> between
> > the notions that we should have no standard fields vs we should have a
> > boatload of standard fields (with most of them empty).  I exchange
> > positions fairly regularly on that question. ;)  It may be worth a dev
> list
> > discussion to lay out how you imagine an extension of standard fields and
> > how it might look as implemented in Metron.
> >
> > Casey
> >
> > Casey
> >
> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > kylerichards...@gmail.com>
> > wrote:
> >
> > > I'll second my preference for the first option. I think the ability to
> > use
> > > Stellar filters to customize indexing would be a big win.
> > >
> > > I'm glad Matt brought up the point about data lake and CEP. I think
> this
> > is
> > > a really important use case that we need to consider. Take a simple
> > > example... If I have data coming in from 3 different firewall vendors
> > and 2
> > > different web proxy/url filtering vendors and I want to be able to
> > analyze
> > > that data set, I need the data to be indexed all together (likely in
> > HDFS)
> > > and to have a normalized schema such that IP address, URL, and user
> name
> > > (to take a few) can be easily queried and aggregated. I can also
> envision
> > > scenarios where I would want to index data based on attributes other
> than
> > > sensor, business unit or subsidiary for example.
> > >
> > > I've been wanted to propose extending our 7 standard fields to include
> > > things like URL and user. Is there community interest/support for
> moving
> > in
> > > that direction? If so, I'll start a new thread.
> > >
> > > Thanks!
> > >
> > > -Kyle
> > >
> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
> > >
> > > > Ah, I see.  If overriding the default index name allows using the
> same
> > > > name for multiple sensors, then the goal can be achieved.
> > > > Thanks,
> > > > --Matt
> > > >
> > > >
> > > > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
> > > >
> > > > Oh, you could!  Let's say you have a syslog parser with data from
> > > > sources 1
> > > > 2 and 3.  You'd end up with one kafka queue with 3 parsers
> attached
> > > to
> > > > that
> > > > queue, each picking part the messages from source 1, 2 and 3.
> > They'd
> > > > go
> > > > through separate enrichment and into the indexing topology.  In
> the
> > > > indexing topology, you could specify the same index name "syslog"
> > and
> > > > all
> > > > of the messages go into the same index for CEP querying if so
> > > desired.
> > > >
> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
> > > wrote:
> > > >
> > > > > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > > > previous
> > > > > life.  It makes perfect sense to route different lines from
> > syslog
> > > > through
> > > > > different appropriate parsers.  But a lot of what the parsers
> do
> > is
> > > > > identify consistent subsets of metadata and annotate it – eg,
> > > > src_ip_addr,
> > > > > event timestamps, etc.  Once those metadata are annotated and
> > > > available
> > > > > with common field names, why doesn’t it make sense to index the
> > > > messages
> > > > > together, for CEP querying?  I think Splunk has illustrated
> this
> > > > model.
> > > > >
> > > > > On 1/12/17, 3:00 PM, "Casey Stella" 
> wrote:
> > > > >
> > > > > yeah, I mean, honestly, I think the approach that we've
> taken
> > > for
> > > > > sources
> > > > > which aggregate different types of data is to provide
> filters
> > > at
> > > > the
> > > > > parser
> > > > > level and have 

Re: [PROPOSAL] up-to-date versioned documentation

2017-01-13 Thread Otto Fowler
I think something that does what you have laid out here, no matter the
implementation details would be ideal


On January 12, 2017 at 18:05:24, Matt Foley (ma...@apache.org) wrote:

We currently have three forms of documentation, with the following
advantages and disadvantages:

|| Docs || Pro || Con ||
| CWiki |
Easy to edit, no special tools required, don't have to be a developer to
contribute, google and wiki search |
Not versioned, no review process, distant from the code, obsolete content
tends to accumulate |
| Site |
Versioned and reviewed, only committers can edit, google search |
Yet another arcane toolset must be learned, only web programmers feel
comfortable contributing, "asf-site" branch not related to code versions,
distant from the code, tends to go obsolete due to non-maintenance |
| README.md |
Versioned and reviewed, only committers can edit, tied to code versions,
highly local to the code being documented |
Non-developers don't know about them, may be scared by github, poor scoring
in google search, no high-level presentation |

Various discussion threads indicate the developer community likes
README-based docs, and it's easy to see why from the above. I propose this
extension to the README-based documentation, to address their
disadvantages:

1. Produce a script that gathers the README.md files from all code
subdirectories into a hierarchical list. The script would have an exclusion
list for non-user-content, which at this point would consist of [site/*,
build_utils/*]. The hierarchy would be sorted depth-first. The resulting
hierarchical list at this time (with six added README files to complete the
hierarchy) would be:

./README.md
./metron-analytics/README.md <== (need file here)
./metron-analytics/metron-maas-service/README.md
./metron-analytics/metron-profiler/README.md
./metron-analytics/metron-profiler-client/README.md
./metron-analytics/metron-statistics/README.md
./metron-deployment/README.md
./metron-deployment/amazon-ec2/README.md
./metron-deployment/packaging/README.md <== (need file here)
./metron-deployment/packaging/ambari/README.md <== (need file here)
./metron-deployment/packaging/docker/ansible-docker/README.md
./metron-deployment/packaging/docker/rpm-docker/README.md
./metron-deployment/packer-build/README.md
./metron-deployment/roles/ <== (need file here)
./metron-deployment/roles/kibana/README.md
./metron-deployment/roles/monit/README.md
./metron-deployment/roles/opentaxii/README.md
./metron-deployment/roles/pcap_replay/README.md
./metron-deployment/roles/sensor-test-mode/README.md
./metron-deployment/vagrant/README.md <== (need file here)
./metron-deployment/vagrant/codelab-platform/README.md
./metron-deployment/vagrant/fastcapa-test-platform/README.md
./metron-deployment/vagrant/full-dev-platform/README.md
./metron-deployment/vagrant/quick-dev-platform/README.md
./metron-platform/README.md
./metron-platform/metron-api/README.md
./metron-platform/metron-common/README.md
./metron-platform/metron-data-management/README.md
./metron-platform/metron-enrichment/README.md
./metron-platform/metron-indexing/README.md
./metron-platform/metron-management/README.md
./metron-platform/metron-parsers/README.md
./metron-platform/metron-pcap-backend/README.md
./metron-sensors/README.md <== (need file here)
./metron-sensors/fastcapa/README.md
./metron-sensors/pycapa/README.md

2. Arrange to run this script as part of the build process, and commit the
resulting hierarchy list to someplace in the versioned and branched ./site/
subdirectory.

3. Produce a "doc reader" web page that takes in this hierarchy of .md
pages, and presents a LHS doc tree of links, and a main display area for a
currently selected file. If we want to get fancy, this page would also
provide: (a) telescoping (collapse/expand) of the doc tree; (b) floating
next/prev/up/home buttons in the display area.

#4. Add to this web page a pull-down menu that selects among all the
release versions of Metron, and (if not running in the Apache site) a
SNAPSHOT version for the current filesystem version (for developer
preview). Let it re-write the file paths per release version to the proper
release tag in github. This web page will therefore be version-independent.
Put it in the asf-site branch of the Apache site, as the new "docs"
sub-site from the home web page. Update the list of releases at each
release, or if we want to get fancy, teach it to read the release tags from
github.

5. As part of the release process, the release manager (a) assures the
release is tagged in github with a consistent naming convention, and (b)
submits the new hierarchy of links to google search (there's an api for
that).

6. Deprecate the use of cwiki for anything but long-lived
demonstrations/tutorials that are unlikely to go obsolete.


Do folks feel this would be a good contribution to the visibility,
timeliness, and usability of our docs?
Is this an adequate solution for the current problems?

Thanks,
--Matt


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler
I prefer option1 with stellar, although I’m concerned that in a real world
scenario the amount of filters and rules might be large, and some thought
about the structure of the rule expressions for maintainability etc will
need to be considered.


On January 12, 2017 at 15:52:03, Casey Stella (ceste...@gmail.com) wrote:

As of METRON-652 , we
will have decoupled the indexing configuration from the enrichment
configuration. As an immediate follow-up to that, I'd like to provide the
ability to turn off and on writers via the configs. I'd like to get some
community feedback on how the functionality should work, if y'all are
amenable. :)


As of now, we have 3 possible writers which can be used in the indexing
topology:

- Solr
- Elasticsearch
- HDFS

HDFS is always used, elasticsearch or solr is used depending on how you
start the indexing topology.

A couple of proposals come to mind immediately:

*Index Filtering*

You would be able to specify a filter as defined by a stellar statement
(likely a reuse of the StellarFilter that exists in the Parsers) which
would allow you to indicate on a message-by-message basis whether or not to
write the message.

The semantics of this would be as follows:

- Default (i.e. unspecified) is to pass everything through (hence
backwards compatible with the current default config).
- Messages which have the associated stellar statement evaluate to true
for the writer type will be written, otherwise not.


Sample indexing config which would write out no messages to HDFS and write
out only messages containing a field called "field1":
{
"index" : "squid"
,"batchSize" : 100
,"filters" : {
"HDFS" : "false"
,"ES" : "exists(field1)"
}
}

*Index On/Off Switch*

A simpler solution would be to just provide a list of writers to write
messages. The semantics would be as follows:

- If the list is unspecified, then the default is to write all messages
for every writer in the indexing topology
- If the list is specified, then a writer will write all messages if and
only if it is named in the list.

Sample indexing config which turns off HDFS and keeps on Elasticsearch:
{
"index" : "squid"
,"batchSize" : 100
,"writers" : [ "ES" ]
}

Thanks in advance for the feedback! Also, if you have any other, better
ideas than the ones presented here, let me know too.

Best,

Casey


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Carolyn Duby
For larger installations you need to control what is indexed so you don’t end 
up with a nasty elastic search situation and so you can mine the data later for 
reports and training ml models.

Thanks
Carolyn




On 1/13/17, 9:40 AM, "Casey Stella"  wrote:

>OH that's a good idea!
>
>On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
>
>> I like the "Index Filtering" option based on the flexibility that it
>> provides.  Should each output (HDFS, ES, etc) have its own configuration
>> settings?  For example, aren't things like batching handled separately for
>> HDFS versus Elasticsearch?
>>
>> Something along the lines of...
>>
>> {
>>   "hdfs" : {
>> "when": "exists(field1)",
>> "batchSize": 100
>>   },
>>
>>   "elasticsearch" : {
>> "when": "true",
>> "batchSize": 1000,
>> "index": "squid"
>>   }
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  wrote:
>>
>> > Yeah, I tend to like the first option too.  Any opposition to that from
>> > anyone?
>> >
>> > The points brought up are good ones and I think that it may be worth a
>> > broader discussion of the requirements of indexing in a separate dev list
>> > thread.  Maybe a list of desires with coherent use-cases justifying them
>> so
>> > we can think about how this stuff should work and where the natural
>> > extension points should be.  Afterall, we need to toe the line between
>> > engineering and overengineering for features nobody will want.
>> >
>> > I'm not sure about the extensions to the standard fields.  I'm torn
>> between
>> > the notions that we should have no standard fields vs we should have a
>> > boatload of standard fields (with most of them empty).  I exchange
>> > positions fairly regularly on that question. ;)  It may be worth a dev
>> list
>> > discussion to lay out how you imagine an extension of standard fields and
>> > how it might look as implemented in Metron.
>> >
>> > Casey
>> >
>> > Casey
>> >
>> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
>> > kylerichards...@gmail.com>
>> > wrote:
>> >
>> > > I'll second my preference for the first option. I think the ability to
>> > use
>> > > Stellar filters to customize indexing would be a big win.
>> > >
>> > > I'm glad Matt brought up the point about data lake and CEP. I think
>> this
>> > is
>> > > a really important use case that we need to consider. Take a simple
>> > > example... If I have data coming in from 3 different firewall vendors
>> > and 2
>> > > different web proxy/url filtering vendors and I want to be able to
>> > analyze
>> > > that data set, I need the data to be indexed all together (likely in
>> > HDFS)
>> > > and to have a normalized schema such that IP address, URL, and user
>> name
>> > > (to take a few) can be easily queried and aggregated. I can also
>> envision
>> > > scenarios where I would want to index data based on attributes other
>> than
>> > > sensor, business unit or subsidiary for example.
>> > >
>> > > I've been wanted to propose extending our 7 standard fields to include
>> > > things like URL and user. Is there community interest/support for
>> moving
>> > in
>> > > that direction? If so, I'll start a new thread.
>> > >
>> > > Thanks!
>> > >
>> > > -Kyle
>> > >
>> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
>> > >
>> > > > Ah, I see.  If overriding the default index name allows using the
>> same
>> > > > name for multiple sensors, then the goal can be achieved.
>> > > > Thanks,
>> > > > --Matt
>> > > >
>> > > >
>> > > > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
>> > > >
>> > > > Oh, you could!  Let's say you have a syslog parser with data from
>> > > > sources 1
>> > > > 2 and 3.  You'd end up with one kafka queue with 3 parsers
>> attached
>> > > to
>> > > > that
>> > > > queue, each picking part the messages from source 1, 2 and 3.
>> > They'd
>> > > > go
>> > > > through separate enrichment and into the indexing topology.  In
>> the
>> > > > indexing topology, you could specify the same index name "syslog"
>> > and
>> > > > all
>> > > > of the messages go into the same index for CEP querying if so
>> > > desired.
>> > > >
>> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
>> > > wrote:
>> > > >
>> > > > > Syslog is hell on parsers – I know, I worked at LogLogic in a
>> > > > previous
>> > > > > life.  It makes perfect sense to route different lines from
>> > syslog
>> > > > through
>> > > > > different appropriate parsers.  But a lot of what the parsers
>> do
>> > is
>> > > > > identify consistent subsets of metadata and annotate it – eg,
>> > > > src_ip_addr,
>> > > > > event timestamps, etc.  Once those metadata are annotated and
>> > > > available
>> > > > > with common field names, why doesn’t it make sense to index the
>> > > > messages
>> > > > > together, for CEP querying?  I think Splunk has 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread David Lyle
The only tooling I'm aware of that Ambari isn't already using is the
Stellar stuff, is there more?

Regardless, I'd always push from Ambari to zookeeper and let other tooling
talk to Ambari (Casey's first bullet). The only wrinkle is we have to
decide if we want to support manual installation. Fwiw, I do. If we did,
we'd need to do a bit of mode selection to support both. But the happy path
would be to do stuff (human or machine) via Ambari.

-D...


On Fri, Jan 13, 2017 at 9:01 AM, Casey Stella  wrote:

> Just piling on in support for Ambari.  I really, really don't like
> reinventing wheels, especially hard ones.  I guess my questions now are
> mainly around technical feasibility.  Seems to me that we can either:
>
>- retrofit the tooling that currently manages configs to use the Ambari
>API's as well as pushing to zokeeper
>- have a service listening to zookeeper and pushing changes to ambari to
>keep it in sync
>- Something that I may have missed
>
> Each of those have pro's and con's.  Thoughts?
>
> Casey
>
> On Fri, Jan 13, 2017 at 8:53 AM, David Lyle  wrote:
>
> > I'm in complete agreement with all the points Matt made. I think the way
> > forward should be to expose ALL user-modifiable configs via Ambari and
> let
> > Ambari actively manage them. We should keep the command line tools as the
> > backend and Ambari should continue to leverage them. This will allow
> manual
> > installation/management if desired and will ensure the command line
> scripts
> > are kept up to date.
> >
> > Fully leveraging Ambari has many beneficial effects. My top four:
> >Provides proper revision control for the configurations
> >Scales easily into things like rolling|quick upgrades and Kerberos
> > support
> >Provides other applications a restful endpoint to change
> configurations
> >We get a force multiplier from the Ambari devs
> >
> > The working description Matt provided is completely consistent with my
> > understanding of how it works (derived from the Ambari docs, authoring
> > pieces of the mpack and interacting with some Ambari devs). Restarting
> > Ambari agent is the only circumstance I'm aware of outside of
> > save/start|restart that would initiate a re-write of the configs and
> cache,
> > there could be others.
> >
> > -D...
> >
> > On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley  wrote:
> >
> > > Mike, could you try again on the image, please, making sure it is a
> > simple
> > > format (gif, png, or jpeg)?  It got munched, at least in my viewer.
> > Thanks.
> > >
> > > Casey, responding to some of the questions you raised:
> > >
> > > I’m going to make a rather strong statement:  We already have a service
> > > “to intermediate and handle config update/retrieval”.
> > > Furthermore, it:
> > > - Correctly handles the problems of distributed services running on
> > > multi-node clusters.  (That’s a HARD problem, people, and we shouldn’t
> > try
> > > to reinvent the wheel.)
> > > - Correctly handles Kerberos security. (That’s kinda hard too, or at
> > least
> > > a lot of work.)
> > > - It does automatic versioning of configurations, and allows viewing,
> > > comparing, and reverting historical configs
> > > - It has a capable REST API for all those things.
> > > It doesn’t natively integrate Zookeeper storage of configs, but there
> is
> > a
> > > natural place to specify copy to/from Zookeeper for the files desired.
> > >
> > > It is Ambari.  And we should commit to it, rather than try to re-create
> > > such features.
> > > Because it has a good REST API, it is perfectly feasible to implement
> > > Stellar functions that call it.
> > > GUI configuration tools can also use the Ambari APIs, or better yet be
> > > integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler
> > > Configuration Tool” example in the Ambari documentation, under “Using
> > > Ambari Views”.)
> > >
> > > Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and
> Not
> > > spending weeks and weeks of developer time over the next year
> reinventing
> > > the wheel while getting details wrong multiple times…
> > >
> > > Okay, off soapbox.
> > >
> > > Casey asked what the config update behavior of Ambari is, and how it
> will
> > > interact with changes made from outside Ambari.
> > > The following is from my experience working with the Ambari Mpack for
> > > Metron.  I am not otherwise an Ambari expert, so tomorrow I’ll get it
> > > reviewed by an Ambari development engineer.
> > >
> > > Ambari-server runs on one node, and Ambari-agent runs on each of all
> the
> > > nodes.
> > > Ambari-server has a private set of py, xml, and template files, which
> > > together are used both to generate the Ambari configuration GUI, with
> > > defaults, and to generate configuration files (of any needed filetype)
> > for
> > > the various Stack components.
> > > Ambari-server also has a database where it stores the schema related 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread David Lyle
That's exactly correct, Casey. Basically, an expansion of what we're
currently doing with global.json, enrichment.properties and
elasticsearch.properties.

-D...


On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella  wrote:

> I would suggest not having Ambari replace zookeeper.  I think the proposal
> is to have Ambari replace the editable store (like the JSON files on
> disk).  Zookeeper woudl be the source of truth for the running topologies
> and ambari would be sync'd to it.
>
> Correct if I misspeak, dave or matt.
>
> Casey
>
> On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen  wrote:
>
> > Ambari seems like a logical choice.
> >
> > *>> It doesn’t natively integrate Zookeeper storage of configs, but there
> > is a natural place to specify copy to/from Zookeeper for the files
> > desired.*
> >
> > How would Ambari interact with Zookeeper in this scenario?  Would Ambari
> > replace Zookeeper completely? Or would Zookeeper act as the persistence
> > tier under Ambari?
> >
> >
> >
> >
> > On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley  wrote:
> >
> > > Mike, could you try again on the image, please, making sure it is a
> > simple
> > > format (gif, png, or jpeg)?  It got munched, at least in my viewer.
> > Thanks.
> > >
> > > Casey, responding to some of the questions you raised:
> > >
> > > I’m going to make a rather strong statement:  We already have a service
> > > “to intermediate and handle config update/retrieval”.
> > > Furthermore, it:
> > > - Correctly handles the problems of distributed services running on
> > > multi-node clusters.  (That’s a HARD problem, people, and we shouldn’t
> > try
> > > to reinvent the wheel.)
> > > - Correctly handles Kerberos security. (That’s kinda hard too, or at
> > least
> > > a lot of work.)
> > > - It does automatic versioning of configurations, and allows viewing,
> > > comparing, and reverting historical configs
> > > - It has a capable REST API for all those things.
> > > It doesn’t natively integrate Zookeeper storage of configs, but there
> is
> > a
> > > natural place to specify copy to/from Zookeeper for the files desired.
> > >
> > > It is Ambari.  And we should commit to it, rather than try to re-create
> > > such features.
> > > Because it has a good REST API, it is perfectly feasible to implement
> > > Stellar functions that call it.
> > > GUI configuration tools can also use the Ambari APIs, or better yet be
> > > integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler
> > > Configuration Tool” example in the Ambari documentation, under “Using
> > > Ambari Views”.)
> > >
> > > Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and
> Not
> > > spending weeks and weeks of developer time over the next year
> reinventing
> > > the wheel while getting details wrong multiple times…
> > >
> > > Okay, off soapbox.
> > >
> > > Casey asked what the config update behavior of Ambari is, and how it
> will
> > > interact with changes made from outside Ambari.
> > > The following is from my experience working with the Ambari Mpack for
> > > Metron.  I am not otherwise an Ambari expert, so tomorrow I’ll get it
> > > reviewed by an Ambari development engineer.
> > >
> > > Ambari-server runs on one node, and Ambari-agent runs on each of all
> the
> > > nodes.
> > > Ambari-server has a private set of py, xml, and template files, which
> > > together are used both to generate the Ambari configuration GUI, with
> > > defaults, and to generate configuration files (of any needed filetype)
> > for
> > > the various Stack components.
> > > Ambari-server also has a database where it stores the schema related to
> > > these files, so even if you reach in and edit Ambari’s files, it will
> > Error
> > > out if the set of parameters or parameter names changes.  The
> historical
> > > information about configuration changes is also stored in the db.
> > > For each component (and in the case of Metron, for each topology),
> there
> > > is a python file which controls the logic for these actions, among
> > others:
> > > - Install
> > > - Start / stop / restart / status
> > > - Configure
> > >
> > > It is actually up to this python code (which we wrote for the Metron
> > > Mpack) what happens in each of these API calls.  But the current code,
> > and
> > > I believe this is typical of Ambari-managed components, performs a
> > > “Configure” action whenever you press the “Save” button after changing
> a
> > > component config in Ambari, and also on each Install and Start or
> > Restart.
> > >
> > > The Configure action consists of approximately the following sequence
> > (see
> > > disclaimer above :-)
> > > - Recreate the generated config files, using the template files and the
> > > actual configuration most recently set in Ambari
> > > o Note this is also under the control of python code that we wrote, and
> > > this is the appropriate place to push to ZK if desired.
> > > - Propagate those config files to 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Casey Stella
So, the reason to have the push operations push to ambari and then have
ambari sync to zookeeper (btw: is this possible, do we have a hook like
that in ambari?) is to make sure that users can specify a comment about
what changed, correct?  If we pushed to zookeeper and had ambari listen
(not sure it can do that either, btw) and update itself, we wouldn't be
able to specify reasons.

Casey

On Fri, Jan 13, 2017 at 9:09 AM, David Lyle  wrote:

> The only tooling I'm aware of that Ambari isn't already using is the
> Stellar stuff, is there more?
>
> Regardless, I'd always push from Ambari to zookeeper and let other tooling
> talk to Ambari (Casey's first bullet). The only wrinkle is we have to
> decide if we want to support manual installation. Fwiw, I do. If we did,
> we'd need to do a bit of mode selection to support both. But the happy path
> would be to do stuff (human or machine) via Ambari.
>
> -D...
>
>
> On Fri, Jan 13, 2017 at 9:01 AM, Casey Stella  wrote:
>
> > Just piling on in support for Ambari.  I really, really don't like
> > reinventing wheels, especially hard ones.  I guess my questions now are
> > mainly around technical feasibility.  Seems to me that we can either:
> >
> >- retrofit the tooling that currently manages configs to use the
> Ambari
> >API's as well as pushing to zokeeper
> >- have a service listening to zookeeper and pushing changes to ambari
> to
> >keep it in sync
> >- Something that I may have missed
> >
> > Each of those have pro's and con's.  Thoughts?
> >
> > Casey
> >
> > On Fri, Jan 13, 2017 at 8:53 AM, David Lyle 
> wrote:
> >
> > > I'm in complete agreement with all the points Matt made. I think the
> way
> > > forward should be to expose ALL user-modifiable configs via Ambari and
> > let
> > > Ambari actively manage them. We should keep the command line tools as
> the
> > > backend and Ambari should continue to leverage them. This will allow
> > manual
> > > installation/management if desired and will ensure the command line
> > scripts
> > > are kept up to date.
> > >
> > > Fully leveraging Ambari has many beneficial effects. My top four:
> > >Provides proper revision control for the configurations
> > >Scales easily into things like rolling|quick upgrades and Kerberos
> > > support
> > >Provides other applications a restful endpoint to change
> > configurations
> > >We get a force multiplier from the Ambari devs
> > >
> > > The working description Matt provided is completely consistent with my
> > > understanding of how it works (derived from the Ambari docs, authoring
> > > pieces of the mpack and interacting with some Ambari devs). Restarting
> > > Ambari agent is the only circumstance I'm aware of outside of
> > > save/start|restart that would initiate a re-write of the configs and
> > cache,
> > > there could be others.
> > >
> > > -D...
> > >
> > > On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley  wrote:
> > >
> > > > Mike, could you try again on the image, please, making sure it is a
> > > simple
> > > > format (gif, png, or jpeg)?  It got munched, at least in my viewer.
> > > Thanks.
> > > >
> > > > Casey, responding to some of the questions you raised:
> > > >
> > > > I’m going to make a rather strong statement:  We already have a
> service
> > > > “to intermediate and handle config update/retrieval”.
> > > > Furthermore, it:
> > > > - Correctly handles the problems of distributed services running on
> > > > multi-node clusters.  (That’s a HARD problem, people, and we
> shouldn’t
> > > try
> > > > to reinvent the wheel.)
> > > > - Correctly handles Kerberos security. (That’s kinda hard too, or at
> > > least
> > > > a lot of work.)
> > > > - It does automatic versioning of configurations, and allows viewing,
> > > > comparing, and reverting historical configs
> > > > - It has a capable REST API for all those things.
> > > > It doesn’t natively integrate Zookeeper storage of configs, but there
> > is
> > > a
> > > > natural place to specify copy to/from Zookeeper for the files
> desired.
> > > >
> > > > It is Ambari.  And we should commit to it, rather than try to
> re-create
> > > > such features.
> > > > Because it has a good REST API, it is perfectly feasible to implement
> > > > Stellar functions that call it.
> > > > GUI configuration tools can also use the Ambari APIs, or better yet
> be
> > > > integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler
> > > > Configuration Tool” example in the Ambari documentation, under “Using
> > > > Ambari Views”.)
> > > >
> > > > Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and
> > Not
> > > > spending weeks and weeks of developer time over the next year
> > reinventing
> > > > the wheel while getting details wrong multiple times…
> > > >
> > > > Okay, off soapbox.
> > > >
> > > > Casey asked what the config update behavior of Ambari is, and how it
> > will
> > > > 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread David Lyle
In the main yes- I've made some changes:

 - Expand ambari to manage the remaining sensor-specific configs
 - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
   to push to ambari and take an Ambari user/pw and (optionally) reason
 - (Ambari can push to zookeeper, but it requires a service restart, so for
"live changes" you may
want do both a rest call and zookeeper update from ConfigurationUtils)
WAS
Question remains about whether ambari can do the push to zookeeper
or whetheror whether ConfigurationUtils has to push to zookeeper as
well as update
ambari.
  - Refactor the middleware that Ryan submitted to have the API calls take
 an Ambari user/pw and (optionally) reason
  - Refactor the management UI to pass in an Ambari user/pw and
(optionally) reason
  - Refactor the Stellar Management functions CONFIG_PUT to accept an
Ambari user/pw and (optionally) reason

I think we'd need to do some detailed design around how to handle what we
expect to be dynamic configs, but the main principle should (imo) be to
always know who and why and make sure that Ambari is aware and is the
static backing store for Zookeeper.

-D...


On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella  wrote:

> So, basically, your proposed changes, broken into tangible gobbets of work:
>
>- Expand ambari to manage the remaining sensor-specific configs
>- Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
>to push to ambari and take a reason
>   - Question remains about whether ambari can do the push to zookeeper
>   or whether ConfigurationUtils has to push to zookeeper as well as
> update
>   ambari.
>- Refactor the middleware that Ryan submitted to have the API calls take
>a reason
>- Refactor the management UI to pass in a reason
>- Refactor the Stellar Management functions CONFIG_PUT to accept a
> reason
>
> Just so we can evaluate it and I can ensure I haven't overlooked some
> important point.  Please tell me if Ambari cannot do the things we're
> suggesting it can do.
>
> Casey
>
> On Fri, Jan 13, 2017 at 9:15 AM, David Lyle  wrote:
>
> > That's exactly correct, Casey. Basically, an expansion of what we're
> > currently doing with global.json, enrichment.properties and
> > elasticsearch.properties.
> >
> > -D...
> >
> >
> > On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella 
> wrote:
> >
> > > I would suggest not having Ambari replace zookeeper.  I think the
> > proposal
> > > is to have Ambari replace the editable store (like the JSON files on
> > > disk).  Zookeeper woudl be the source of truth for the running
> topologies
> > > and ambari would be sync'd to it.
> > >
> > > Correct if I misspeak, dave or matt.
> > >
> > > Casey
> > >
> > > On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen 
> wrote:
> > >
> > > > Ambari seems like a logical choice.
> > > >
> > > > *>> It doesn’t natively integrate Zookeeper storage of configs, but
> > there
> > > > is a natural place to specify copy to/from Zookeeper for the files
> > > > desired.*
> > > >
> > > > How would Ambari interact with Zookeeper in this scenario?  Would
> > Ambari
> > > > replace Zookeeper completely? Or would Zookeeper act as the
> persistence
> > > > tier under Ambari?
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley 
> wrote:
> > > >
> > > > > Mike, could you try again on the image, please, making sure it is a
> > > > simple
> > > > > format (gif, png, or jpeg)?  It got munched, at least in my viewer.
> > > > Thanks.
> > > > >
> > > > > Casey, responding to some of the questions you raised:
> > > > >
> > > > > I’m going to make a rather strong statement:  We already have a
> > service
> > > > > “to intermediate and handle config update/retrieval”.
> > > > > Furthermore, it:
> > > > > - Correctly handles the problems of distributed services running on
> > > > > multi-node clusters.  (That’s a HARD problem, people, and we
> > shouldn’t
> > > > try
> > > > > to reinvent the wheel.)
> > > > > - Correctly handles Kerberos security. (That’s kinda hard too, or
> at
> > > > least
> > > > > a lot of work.)
> > > > > - It does automatic versioning of configurations, and allows
> viewing,
> > > > > comparing, and reverting historical configs
> > > > > - It has a capable REST API for all those things.
> > > > > It doesn’t natively integrate Zookeeper storage of configs, but
> there
> > > is
> > > > a
> > > > > natural place to specify copy to/from Zookeeper for the files
> > desired.
> > > > >
> > > > > It is Ambari.  And we should commit to it, rather than try to
> > re-create
> > > > > such features.
> > > > > Because it has a good REST API, it is perfectly feasible to
> implement
> > > > > Stellar functions that call it.
> > > > > GUI configuration tools can also use the Ambari APIs, or better yet
> > be
> > > > > integrated in an “Ambari View”. (Eg, see the 

[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-13 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@JonZeolla Yes it is!  Whoops, my bad.  I guess my JIRA search-fu isn't as 
good as I thought.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Nick Allen
Let me ask a stupid question.  What does Zookeeper do for us that Ambari
cannot?  Why keep Zookeeper in the mix?



On Fri, Jan 13, 2017 at 9:28 AM, David Lyle  wrote:

> In the main yes- I've made some changes:
>
>  - Expand ambari to manage the remaining sensor-specific configs
>  - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
>to push to ambari and take an Ambari user/pw and (optionally) reason
>  - (Ambari can push to zookeeper, but it requires a service restart, so for
> "live changes" you may
> want do both a rest call and zookeeper update from ConfigurationUtils)
> WAS
> Question remains about whether ambari can do the push to zookeeper
> or whetheror whether ConfigurationUtils has to push to zookeeper as
> well as update
> ambari.
>   - Refactor the middleware that Ryan submitted to have the API calls take
>  an Ambari user/pw and (optionally) reason
>   - Refactor the management UI to pass in an Ambari user/pw and
> (optionally) reason
>   - Refactor the Stellar Management functions CONFIG_PUT to accept an
> Ambari user/pw and (optionally) reason
>
> I think we'd need to do some detailed design around how to handle what we
> expect to be dynamic configs, but the main principle should (imo) be to
> always know who and why and make sure that Ambari is aware and is the
> static backing store for Zookeeper.
>
> -D...
>
>
> On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella  wrote:
>
> > So, basically, your proposed changes, broken into tangible gobbets of
> work:
> >
> >- Expand ambari to manage the remaining sensor-specific configs
> >- Refactor the push calls to zookeeper (in ConfigurationUtils, I
> think)
> >to push to ambari and take a reason
> >   - Question remains about whether ambari can do the push to
> zookeeper
> >   or whether ConfigurationUtils has to push to zookeeper as well as
> > update
> >   ambari.
> >- Refactor the middleware that Ryan submitted to have the API calls
> take
> >a reason
> >- Refactor the management UI to pass in a reason
> >- Refactor the Stellar Management functions CONFIG_PUT to accept a
> > reason
> >
> > Just so we can evaluate it and I can ensure I haven't overlooked some
> > important point.  Please tell me if Ambari cannot do the things we're
> > suggesting it can do.
> >
> > Casey
> >
> > On Fri, Jan 13, 2017 at 9:15 AM, David Lyle 
> wrote:
> >
> > > That's exactly correct, Casey. Basically, an expansion of what we're
> > > currently doing with global.json, enrichment.properties and
> > > elasticsearch.properties.
> > >
> > > -D...
> > >
> > >
> > > On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella 
> > wrote:
> > >
> > > > I would suggest not having Ambari replace zookeeper.  I think the
> > > proposal
> > > > is to have Ambari replace the editable store (like the JSON files on
> > > > disk).  Zookeeper woudl be the source of truth for the running
> > topologies
> > > > and ambari would be sync'd to it.
> > > >
> > > > Correct if I misspeak, dave or matt.
> > > >
> > > > Casey
> > > >
> > > > On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen 
> > wrote:
> > > >
> > > > > Ambari seems like a logical choice.
> > > > >
> > > > > *>> It doesn’t natively integrate Zookeeper storage of configs, but
> > > there
> > > > > is a natural place to specify copy to/from Zookeeper for the files
> > > > > desired.*
> > > > >
> > > > > How would Ambari interact with Zookeeper in this scenario?  Would
> > > Ambari
> > > > > replace Zookeeper completely? Or would Zookeeper act as the
> > persistence
> > > > > tier under Ambari?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley 
> > wrote:
> > > > >
> > > > > > Mike, could you try again on the image, please, making sure it
> is a
> > > > > simple
> > > > > > format (gif, png, or jpeg)?  It got munched, at least in my
> viewer.
> > > > > Thanks.
> > > > > >
> > > > > > Casey, responding to some of the questions you raised:
> > > > > >
> > > > > > I’m going to make a rather strong statement:  We already have a
> > > service
> > > > > > “to intermediate and handle config update/retrieval”.
> > > > > > Furthermore, it:
> > > > > > - Correctly handles the problems of distributed services running
> on
> > > > > > multi-node clusters.  (That’s a HARD problem, people, and we
> > > shouldn’t
> > > > > try
> > > > > > to reinvent the wheel.)
> > > > > > - Correctly handles Kerberos security. (That’s kinda hard too, or
> > at
> > > > > least
> > > > > > a lot of work.)
> > > > > > - It does automatic versioning of configurations, and allows
> > viewing,
> > > > > > comparing, and reverting historical configs
> > > > > > - It has a capable REST API for all those things.
> > > > > > It doesn’t natively integrate Zookeeper storage of configs, but
> > there
> > > > is
> > > > > a
> > > > > > natural 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com
Darn it Nick, you beat me to the punch.  =)  YES, please.  I think I
discussed this a while back in my ES tuning conversations, but that's
_super_ important.  I have this documented here

under Elasticsearch > On Installation > 4.

Also, I'm a huge fan of option one.  Here's how that would pan out in my
environment almost immediately:

I typically don't want to store POST data, so as it currently sits that
means I don't write the details of POSTs at all.  However it may make sense
for me to collect the POST data off the wire and pass it through Metron,
and once it gets through enrichment/threat intel and if one of the IPs are
noteworthy, only then I store it, otherwise it gets tossed.  That gives me
a nice mix of privacy/security for my user population but also the
information I need to respond to possible incidents.  This could look like
holding onto POSTs being used to manipulate web shells (known bad sources,
known compromised hosts (as a very short term IR information gathering
procedure)), or users POSTing their creds to a plaintext phishing site.
Happy to port this discussion to a separate thread.

Regarding fields - I'm for slightly more standardization, without going
overboard.  I don't think we should go overboard, and this could be a long
discussion, but in summary my opinion is that user is a very sane field to
add, and URL is slightly less so.  Again, I'm thinking about this in the
context of what data I have going into my cluster (and that I know others
are sending to theirs), which may contrast with a more general infosec
population.

Jon

On Fri, Jan 13, 2017 at 9:40 AM Casey Stella  wrote:

> OH that's a good idea!
>
> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
>
> > I like the "Index Filtering" option based on the flexibility that it
> > provides.  Should each output (HDFS, ES, etc) have its own configuration
> > settings?  For example, aren't things like batching handled separately
> for
> > HDFS versus Elasticsearch?
> >
> > Something along the lines of...
> >
> > {
> >   "hdfs" : {
> > "when": "exists(field1)",
> > "batchSize": 100
> >   },
> >
> >   "elasticsearch" : {
> > "when": "true",
> > "batchSize": 1000,
> > "index": "squid"
> >   }
> > }
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> wrote:
> >
> > > Yeah, I tend to like the first option too.  Any opposition to that from
> > > anyone?
> > >
> > > The points brought up are good ones and I think that it may be worth a
> > > broader discussion of the requirements of indexing in a separate dev
> list
> > > thread.  Maybe a list of desires with coherent use-cases justifying
> them
> > so
> > > we can think about how this stuff should work and where the natural
> > > extension points should be.  Afterall, we need to toe the line between
> > > engineering and overengineering for features nobody will want.
> > >
> > > I'm not sure about the extensions to the standard fields.  I'm torn
> > between
> > > the notions that we should have no standard fields vs we should have a
> > > boatload of standard fields (with most of them empty).  I exchange
> > > positions fairly regularly on that question. ;)  It may be worth a dev
> > list
> > > discussion to lay out how you imagine an extension of standard fields
> and
> > > how it might look as implemented in Metron.
> > >
> > > Casey
> > >
> > > Casey
> > >
> > > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > > kylerichards...@gmail.com>
> > > wrote:
> > >
> > > > I'll second my preference for the first option. I think the ability
> to
> > > use
> > > > Stellar filters to customize indexing would be a big win.
> > > >
> > > > I'm glad Matt brought up the point about data lake and CEP. I think
> > this
> > > is
> > > > a really important use case that we need to consider. Take a simple
> > > > example... If I have data coming in from 3 different firewall vendors
> > > and 2
> > > > different web proxy/url filtering vendors and I want to be able to
> > > analyze
> > > > that data set, I need the data to be indexed all together (likely in
> > > HDFS)
> > > > and to have a normalized schema such that IP address, URL, and user
> > name
> > > > (to take a few) can be easily queried and aggregated. I can also
> > envision
> > > > scenarios where I would want to index data based on attributes other
> > than
> > > > sensor, business unit or subsidiary for example.
> > > >
> > > > I've been wanted to propose extending our 7 standard fields to
> include
> > > > things like URL and user. Is there community interest/support for
> > moving
> > > in
> > > > that direction? If so, I'll start a new thread.
> > > >
> > > > Thanks!
> > > >
> > > > -Kyle
> > > >
> > > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley 
> wrote:
> > > >
> > > > > Ah, I see.  If overriding the default index name 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Casey Stella
Just piling on in support for Ambari.  I really, really don't like
reinventing wheels, especially hard ones.  I guess my questions now are
mainly around technical feasibility.  Seems to me that we can either:

   - retrofit the tooling that currently manages configs to use the Ambari
   API's as well as pushing to zokeeper
   - have a service listening to zookeeper and pushing changes to ambari to
   keep it in sync
   - Something that I may have missed

Each of those have pro's and con's.  Thoughts?

Casey

On Fri, Jan 13, 2017 at 8:53 AM, David Lyle  wrote:

> I'm in complete agreement with all the points Matt made. I think the way
> forward should be to expose ALL user-modifiable configs via Ambari and let
> Ambari actively manage them. We should keep the command line tools as the
> backend and Ambari should continue to leverage them. This will allow manual
> installation/management if desired and will ensure the command line scripts
> are kept up to date.
>
> Fully leveraging Ambari has many beneficial effects. My top four:
>Provides proper revision control for the configurations
>Scales easily into things like rolling|quick upgrades and Kerberos
> support
>Provides other applications a restful endpoint to change configurations
>We get a force multiplier from the Ambari devs
>
> The working description Matt provided is completely consistent with my
> understanding of how it works (derived from the Ambari docs, authoring
> pieces of the mpack and interacting with some Ambari devs). Restarting
> Ambari agent is the only circumstance I'm aware of outside of
> save/start|restart that would initiate a re-write of the configs and cache,
> there could be others.
>
> -D...
>
> On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley  wrote:
>
> > Mike, could you try again on the image, please, making sure it is a
> simple
> > format (gif, png, or jpeg)?  It got munched, at least in my viewer.
> Thanks.
> >
> > Casey, responding to some of the questions you raised:
> >
> > I’m going to make a rather strong statement:  We already have a service
> > “to intermediate and handle config update/retrieval”.
> > Furthermore, it:
> > - Correctly handles the problems of distributed services running on
> > multi-node clusters.  (That’s a HARD problem, people, and we shouldn’t
> try
> > to reinvent the wheel.)
> > - Correctly handles Kerberos security. (That’s kinda hard too, or at
> least
> > a lot of work.)
> > - It does automatic versioning of configurations, and allows viewing,
> > comparing, and reverting historical configs
> > - It has a capable REST API for all those things.
> > It doesn’t natively integrate Zookeeper storage of configs, but there is
> a
> > natural place to specify copy to/from Zookeeper for the files desired.
> >
> > It is Ambari.  And we should commit to it, rather than try to re-create
> > such features.
> > Because it has a good REST API, it is perfectly feasible to implement
> > Stellar functions that call it.
> > GUI configuration tools can also use the Ambari APIs, or better yet be
> > integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler
> > Configuration Tool” example in the Ambari documentation, under “Using
> > Ambari Views”.)
> >
> > Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and Not
> > spending weeks and weeks of developer time over the next year reinventing
> > the wheel while getting details wrong multiple times…
> >
> > Okay, off soapbox.
> >
> > Casey asked what the config update behavior of Ambari is, and how it will
> > interact with changes made from outside Ambari.
> > The following is from my experience working with the Ambari Mpack for
> > Metron.  I am not otherwise an Ambari expert, so tomorrow I’ll get it
> > reviewed by an Ambari development engineer.
> >
> > Ambari-server runs on one node, and Ambari-agent runs on each of all the
> > nodes.
> > Ambari-server has a private set of py, xml, and template files, which
> > together are used both to generate the Ambari configuration GUI, with
> > defaults, and to generate configuration files (of any needed filetype)
> for
> > the various Stack components.
> > Ambari-server also has a database where it stores the schema related to
> > these files, so even if you reach in and edit Ambari’s files, it will
> Error
> > out if the set of parameters or parameter names changes.  The historical
> > information about configuration changes is also stored in the db.
> > For each component (and in the case of Metron, for each topology), there
> > is a python file which controls the logic for these actions, among
> others:
> > - Install
> > - Start / stop / restart / status
> > - Configure
> >
> > It is actually up to this python code (which we wrote for the Metron
> > Mpack) what happens in each of these API calls.  But the current code,
> and
> > I believe this is typical of Ambari-managed components, performs a
> > “Configure” action whenever you press the 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Casey Stella
So, basically, your proposed changes, broken into tangible gobbets of work:

   - Expand ambari to manage the remaining sensor-specific configs
   - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
   to push to ambari and take a reason
  - Question remains about whether ambari can do the push to zookeeper
  or whether ConfigurationUtils has to push to zookeeper as well as update
  ambari.
   - Refactor the middleware that Ryan submitted to have the API calls take
   a reason
   - Refactor the management UI to pass in a reason
   - Refactor the Stellar Management functions CONFIG_PUT to accept a reason

Just so we can evaluate it and I can ensure I haven't overlooked some
important point.  Please tell me if Ambari cannot do the things we're
suggesting it can do.

Casey

On Fri, Jan 13, 2017 at 9:15 AM, David Lyle  wrote:

> That's exactly correct, Casey. Basically, an expansion of what we're
> currently doing with global.json, enrichment.properties and
> elasticsearch.properties.
>
> -D...
>
>
> On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella  wrote:
>
> > I would suggest not having Ambari replace zookeeper.  I think the
> proposal
> > is to have Ambari replace the editable store (like the JSON files on
> > disk).  Zookeeper woudl be the source of truth for the running topologies
> > and ambari would be sync'd to it.
> >
> > Correct if I misspeak, dave or matt.
> >
> > Casey
> >
> > On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen  wrote:
> >
> > > Ambari seems like a logical choice.
> > >
> > > *>> It doesn’t natively integrate Zookeeper storage of configs, but
> there
> > > is a natural place to specify copy to/from Zookeeper for the files
> > > desired.*
> > >
> > > How would Ambari interact with Zookeeper in this scenario?  Would
> Ambari
> > > replace Zookeeper completely? Or would Zookeeper act as the persistence
> > > tier under Ambari?
> > >
> > >
> > >
> > >
> > > On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley  wrote:
> > >
> > > > Mike, could you try again on the image, please, making sure it is a
> > > simple
> > > > format (gif, png, or jpeg)?  It got munched, at least in my viewer.
> > > Thanks.
> > > >
> > > > Casey, responding to some of the questions you raised:
> > > >
> > > > I’m going to make a rather strong statement:  We already have a
> service
> > > > “to intermediate and handle config update/retrieval”.
> > > > Furthermore, it:
> > > > - Correctly handles the problems of distributed services running on
> > > > multi-node clusters.  (That’s a HARD problem, people, and we
> shouldn’t
> > > try
> > > > to reinvent the wheel.)
> > > > - Correctly handles Kerberos security. (That’s kinda hard too, or at
> > > least
> > > > a lot of work.)
> > > > - It does automatic versioning of configurations, and allows viewing,
> > > > comparing, and reverting historical configs
> > > > - It has a capable REST API for all those things.
> > > > It doesn’t natively integrate Zookeeper storage of configs, but there
> > is
> > > a
> > > > natural place to specify copy to/from Zookeeper for the files
> desired.
> > > >
> > > > It is Ambari.  And we should commit to it, rather than try to
> re-create
> > > > such features.
> > > > Because it has a good REST API, it is perfectly feasible to implement
> > > > Stellar functions that call it.
> > > > GUI configuration tools can also use the Ambari APIs, or better yet
> be
> > > > integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler
> > > > Configuration Tool” example in the Ambari documentation, under “Using
> > > > Ambari Views”.)
> > > >
> > > > Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and
> > Not
> > > > spending weeks and weeks of developer time over the next year
> > reinventing
> > > > the wheel while getting details wrong multiple times…
> > > >
> > > > Okay, off soapbox.
> > > >
> > > > Casey asked what the config update behavior of Ambari is, and how it
> > will
> > > > interact with changes made from outside Ambari.
> > > > The following is from my experience working with the Ambari Mpack for
> > > > Metron.  I am not otherwise an Ambari expert, so tomorrow I’ll get it
> > > > reviewed by an Ambari development engineer.
> > > >
> > > > Ambari-server runs on one node, and Ambari-agent runs on each of all
> > the
> > > > nodes.
> > > > Ambari-server has a private set of py, xml, and template files, which
> > > > together are used both to generate the Ambari configuration GUI, with
> > > > defaults, and to generate configuration files (of any needed
> filetype)
> > > for
> > > > the various Stack components.
> > > > Ambari-server also has a database where it stores the schema related
> to
> > > > these files, so even if you reach in and edit Ambari’s files, it will
> > > Error
> > > > out if the set of parameters or parameter names changes.  The
> > historical
> > > > information about configuration changes is 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen
I like the "Index Filtering" option based on the flexibility that it
provides.  Should each output (HDFS, ES, etc) have its own configuration
settings?  For example, aren't things like batching handled separately for
HDFS versus Elasticsearch?

Something along the lines of...

{
  "hdfs" : {
"when": "exists(field1)",
"batchSize": 100
  },

  "elasticsearch" : {
"when": "true",
"batchSize": 1000,
"index": "squid"
  }
}








On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  wrote:

> Yeah, I tend to like the first option too.  Any opposition to that from
> anyone?
>
> The points brought up are good ones and I think that it may be worth a
> broader discussion of the requirements of indexing in a separate dev list
> thread.  Maybe a list of desires with coherent use-cases justifying them so
> we can think about how this stuff should work and where the natural
> extension points should be.  Afterall, we need to toe the line between
> engineering and overengineering for features nobody will want.
>
> I'm not sure about the extensions to the standard fields.  I'm torn between
> the notions that we should have no standard fields vs we should have a
> boatload of standard fields (with most of them empty).  I exchange
> positions fairly regularly on that question. ;)  It may be worth a dev list
> discussion to lay out how you imagine an extension of standard fields and
> how it might look as implemented in Metron.
>
> Casey
>
> Casey
>
> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> kylerichards...@gmail.com>
> wrote:
>
> > I'll second my preference for the first option. I think the ability to
> use
> > Stellar filters to customize indexing would be a big win.
> >
> > I'm glad Matt brought up the point about data lake and CEP. I think this
> is
> > a really important use case that we need to consider. Take a simple
> > example... If I have data coming in from 3 different firewall vendors
> and 2
> > different web proxy/url filtering vendors and I want to be able to
> analyze
> > that data set, I need the data to be indexed all together (likely in
> HDFS)
> > and to have a normalized schema such that IP address, URL, and user name
> > (to take a few) can be easily queried and aggregated. I can also envision
> > scenarios where I would want to index data based on attributes other than
> > sensor, business unit or subsidiary for example.
> >
> > I've been wanted to propose extending our 7 standard fields to include
> > things like URL and user. Is there community interest/support for moving
> in
> > that direction? If so, I'll start a new thread.
> >
> > Thanks!
> >
> > -Kyle
> >
> > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
> >
> > > Ah, I see.  If overriding the default index name allows using the same
> > > name for multiple sensors, then the goal can be achieved.
> > > Thanks,
> > > --Matt
> > >
> > >
> > > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
> > >
> > > Oh, you could!  Let's say you have a syslog parser with data from
> > > sources 1
> > > 2 and 3.  You'd end up with one kafka queue with 3 parsers attached
> > to
> > > that
> > > queue, each picking part the messages from source 1, 2 and 3.
> They'd
> > > go
> > > through separate enrichment and into the indexing topology.  In the
> > > indexing topology, you could specify the same index name "syslog"
> and
> > > all
> > > of the messages go into the same index for CEP querying if so
> > desired.
> > >
> > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
> > wrote:
> > >
> > > > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > > previous
> > > > life.  It makes perfect sense to route different lines from
> syslog
> > > through
> > > > different appropriate parsers.  But a lot of what the parsers do
> is
> > > > identify consistent subsets of metadata and annotate it – eg,
> > > src_ip_addr,
> > > > event timestamps, etc.  Once those metadata are annotated and
> > > available
> > > > with common field names, why doesn’t it make sense to index the
> > > messages
> > > > together, for CEP querying?  I think Splunk has illustrated this
> > > model.
> > > >
> > > > On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
> > > >
> > > > yeah, I mean, honestly, I think the approach that we've taken
> > for
> > > > sources
> > > > which aggregate different types of data is to provide filters
> > at
> > > the
> > > > parser
> > > > level and have multiple parser topologies (with different,
> > > possibly
> > > > mutually exclusive filters) running.  This would be a
> > completely
> > > > separate
> > > > sensor.  Imagine a syslog data source that aggregates and you
> > > want to
> > > > pick
> > > > apart certain pieces of messages.  This is why the initial
> > > thought and
> > > > 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
One thing that I thought of that I very strenuous do not like in Nick's
proposal is that if a writer config is not specified then it is turned off
(I think; if I misunderstood let me know). In the situation where we have a
new sensor, right now if there are no index config and no enrichment
config, it still passes through to the index using defaults. In this new
scheme it would not. This changes the default semantics for the system and
I think it changes it for the worse.

I would strongly prefer a on-by-default indexing config as we have now.
On Fri, Jan 13, 2017 at 17:13 Casey Stella  wrote:

> One thing that I really like about Nick's suggestion is that it allows
> writer-specific configs in a clear and simple way.  It is more complex for
> the default case (all writers write to indices named the same thing with a
> fixed batch size), which I do not like, but maybe it's worth the compromise
> to make it less complex for the advanced case.
>
> Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning to
> lean your way.
>
> On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
> wrote:
>
> I like the suggestions you made, Nick.  The only thing I would add is that
> it's also nice to see an explicit when(false), as people newer to the
> platform may not know where to expect configs for the different writers.
> Being able to do it either way, which I think is already assumed in your
> model, would make sense.  I would just suggest that, if we support but are
> disabling a writer, that the platform inserts a default when(false) to be
> explicit.
>
> Jon
>
> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella  wrote:
>
> > Let me noodle on this over the weekend.  Your syntax is looking less
> > onerous to me and I like the following statement from Otto: "In the end,
> > each write destination ‘type’ will need it’s own configuration.  This is
> an
> > extension point."
> >
> > I may come around to your way of thinking.
> >
> > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
> > wrote:
> >
> > > In the end, each write destination ‘type’ will need it’s own
> > > configuration.  This is an extension point.
> > > {
> > > HDFS:{
> > > outputAdapters:[
> > > {name: avro,
> > > settings:{
> > > avro stuff….
> > > when:{
> > > },
> > > {
> > >  name: sequence file,
> > >  …..
> > >
> > > or some such.
> > >
> > >
> > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org)
> wrote:
> > >
> > > I will add also that instead of global overrides, like index, we should
> > use
> > > configuration key names that are more appropriate to the output.
> > >
> > > For example, does 'index' really make sense for HDFS? Or would 'path'
> be
> > > more appropriate?
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 1
> > > },
> > > 'hdfs': {
> > > 'path': '/foo/bar/...',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > > Ok, I've said my peace. Thanks for the effort in summarizing all this,
> > > Casey.
> > >
> > >
> > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen 
> wrote:
> > >
> > > > Nick's concerns about my suggestion were that it was overly complex
> and
> > > >> hard to grok and that we could dispense with backwards compatibility
> > and
> > > >> make people do a bit more work on the default case for the benefits
> > of a
> > > >> simpler advanced case. (Nick, make sure I don't misstate your
> > position)
> > > >
> > > >
> > > > I will add is that in my mind, the majority case would be a user
> > > > specifying the outputs, but not things like 'batchSize' or 'when'. I
> > > think
> > > > in the majority case, the user would accept whatever the default
> batch
> > > size
> > > > is.
> > > >
> > > > Here are alternatives suggestions for all the examples that you
> > provided
> > > > previously.
> > > >
> > > > Base Case
> > > >
> > > > - The user must always specify the 'outputs' for clarity.
> > > > - Uses default index name, batch size and when = true.
> > > >
> > > > {
> > > > 'elasticsearch': {},
> > > > 'hdfs': {}
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-non-specific-case>Writer-non-specific
> > >
> > > > Case
> > > >
> > > > - There are no global overrides, as in Casey's proposal.
> > > > - Easier to grok IMO.
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-without-filters>Writer-specific
> > >
> > > > case without filters
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 1
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread David Lyle
I'm in complete agreement with all the points Matt made. I think the way
forward should be to expose ALL user-modifiable configs via Ambari and let
Ambari actively manage them. We should keep the command line tools as the
backend and Ambari should continue to leverage them. This will allow manual
installation/management if desired and will ensure the command line scripts
are kept up to date.

Fully leveraging Ambari has many beneficial effects. My top four:
   Provides proper revision control for the configurations
   Scales easily into things like rolling|quick upgrades and Kerberos
support
   Provides other applications a restful endpoint to change configurations
   We get a force multiplier from the Ambari devs

The working description Matt provided is completely consistent with my
understanding of how it works (derived from the Ambari docs, authoring
pieces of the mpack and interacting with some Ambari devs). Restarting
Ambari agent is the only circumstance I'm aware of outside of
save/start|restart that would initiate a re-write of the configs and cache,
there could be others.

-D...

On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley  wrote:

> Mike, could you try again on the image, please, making sure it is a simple
> format (gif, png, or jpeg)?  It got munched, at least in my viewer.  Thanks.
>
> Casey, responding to some of the questions you raised:
>
> I’m going to make a rather strong statement:  We already have a service
> “to intermediate and handle config update/retrieval”.
> Furthermore, it:
> - Correctly handles the problems of distributed services running on
> multi-node clusters.  (That’s a HARD problem, people, and we shouldn’t try
> to reinvent the wheel.)
> - Correctly handles Kerberos security. (That’s kinda hard too, or at least
> a lot of work.)
> - It does automatic versioning of configurations, and allows viewing,
> comparing, and reverting historical configs
> - It has a capable REST API for all those things.
> It doesn’t natively integrate Zookeeper storage of configs, but there is a
> natural place to specify copy to/from Zookeeper for the files desired.
>
> It is Ambari.  And we should commit to it, rather than try to re-create
> such features.
> Because it has a good REST API, it is perfectly feasible to implement
> Stellar functions that call it.
> GUI configuration tools can also use the Ambari APIs, or better yet be
> integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler
> Configuration Tool” example in the Ambari documentation, under “Using
> Ambari Views”.)
>
> Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and Not
> spending weeks and weeks of developer time over the next year reinventing
> the wheel while getting details wrong multiple times…
>
> Okay, off soapbox.
>
> Casey asked what the config update behavior of Ambari is, and how it will
> interact with changes made from outside Ambari.
> The following is from my experience working with the Ambari Mpack for
> Metron.  I am not otherwise an Ambari expert, so tomorrow I’ll get it
> reviewed by an Ambari development engineer.
>
> Ambari-server runs on one node, and Ambari-agent runs on each of all the
> nodes.
> Ambari-server has a private set of py, xml, and template files, which
> together are used both to generate the Ambari configuration GUI, with
> defaults, and to generate configuration files (of any needed filetype) for
> the various Stack components.
> Ambari-server also has a database where it stores the schema related to
> these files, so even if you reach in and edit Ambari’s files, it will Error
> out if the set of parameters or parameter names changes.  The historical
> information about configuration changes is also stored in the db.
> For each component (and in the case of Metron, for each topology), there
> is a python file which controls the logic for these actions, among others:
> - Install
> - Start / stop / restart / status
> - Configure
>
> It is actually up to this python code (which we wrote for the Metron
> Mpack) what happens in each of these API calls.  But the current code, and
> I believe this is typical of Ambari-managed components, performs a
> “Configure” action whenever you press the “Save” button after changing a
> component config in Ambari, and also on each Install and Start or Restart.
>
> The Configure action consists of approximately the following sequence (see
> disclaimer above :-)
> - Recreate the generated config files, using the template files and the
> actual configuration most recently set in Ambari
> o Note this is also under the control of python code that we wrote, and
> this is the appropriate place to push to ZK if desired.
> - Propagate those config files to each Ambari-agent, with a command to set
> them locally
> - The ambari-agents on each node receive the files and write them to the
> specified locations on local storage
>
> Ambari-server then whines that the updated services should be restarted,
> but does not 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
Ok, so here's what I'm thinking based on the discussion:

   - Keeping the configs that we have now (batchSize and index) as defaults
   for the unspecified writer-specific case
   - Adding the config Nick suggested

*Base Case*:
{
}

   - all writers write all messages
   - index named the same as the sensor for all writers
   - batchSize of 1 for all writers

*Writer-non-specific case*:
{
  "index" : "foo"
 ,"batchSize" : 100
}

   - All writers write all messages
   - index is named "foo", different from the sensor for all writers
   - batchSize is 100 for all writers

*Writer-specific case without filters*
{
  "index" : "foo"
 ,"batchSize" : 1
 , "writerConfig" :
   {
  "elasticsearch" : {
   "batchSize" : 100
 }
   }
}

   - All writers write all messages
   - index is named "foo", different from the sensor for all writers
   - batchSize is 1 for HDFS and 100 for elasticsearch writers
   - NOTE: I could override the index name too

*Writer-specific case with filters*
{
  "index" : "foo"
 ,"batchSize" : 1
 , "writerConfig" :
   {
  "elasticsearch" : {
   "batchSize" : 100,
   "when" : "exists(field1)"
 },
  "hdfs" : {
 "when" : "false"
  }
   }
}

   - ES writer writes messages which have field1, HDFS doesn't
   - index is named "foo", different from the sensor for all writers
   - 100 for elasticsearch writers

Thoughts?

On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby  wrote:

> For larger installations you need to control what is indexed so you don’t
> end up with a nasty elastic search situation and so you can mine the data
> later for reports and training ml models.
>
> Thanks
> Carolyn
>
>
>
>
> On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
>
> >OH that's a good idea!
> >
> >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> >
> >> I like the "Index Filtering" option based on the flexibility that it
> >> provides.  Should each output (HDFS, ES, etc) have its own configuration
> >> settings?  For example, aren't things like batching handled separately
> for
> >> HDFS versus Elasticsearch?
> >>
> >> Something along the lines of...
> >>
> >> {
> >>   "hdfs" : {
> >> "when": "exists(field1)",
> >> "batchSize": 100
> >>   },
> >>
> >>   "elasticsearch" : {
> >> "when": "true",
> >> "batchSize": 1000,
> >> "index": "squid"
> >>   }
> >> }
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> wrote:
> >>
> >> > Yeah, I tend to like the first option too.  Any opposition to that
> from
> >> > anyone?
> >> >
> >> > The points brought up are good ones and I think that it may be worth a
> >> > broader discussion of the requirements of indexing in a separate dev
> list
> >> > thread.  Maybe a list of desires with coherent use-cases justifying
> them
> >> so
> >> > we can think about how this stuff should work and where the natural
> >> > extension points should be.  Afterall, we need to toe the line between
> >> > engineering and overengineering for features nobody will want.
> >> >
> >> > I'm not sure about the extensions to the standard fields.  I'm torn
> >> between
> >> > the notions that we should have no standard fields vs we should have a
> >> > boatload of standard fields (with most of them empty).  I exchange
> >> > positions fairly regularly on that question. ;)  It may be worth a dev
> >> list
> >> > discussion to lay out how you imagine an extension of standard fields
> and
> >> > how it might look as implemented in Metron.
> >> >
> >> > Casey
> >> >
> >> > Casey
> >> >
> >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> >> > kylerichards...@gmail.com>
> >> > wrote:
> >> >
> >> > > I'll second my preference for the first option. I think the ability
> to
> >> > use
> >> > > Stellar filters to customize indexing would be a big win.
> >> > >
> >> > > I'm glad Matt brought up the point about data lake and CEP. I think
> >> this
> >> > is
> >> > > a really important use case that we need to consider. Take a simple
> >> > > example... If I have data coming in from 3 different firewall
> vendors
> >> > and 2
> >> > > different web proxy/url filtering vendors and I want to be able to
> >> > analyze
> >> > > that data set, I need the data to be indexed all together (likely in
> >> > HDFS)
> >> > > and to have a normalized schema such that IP address, URL, and user
> >> name
> >> > > (to take a few) can be easily queried and aggregated. I can also
> >> envision
> >> > > scenarios where I would want to index data based on attributes other
> >> than
> >> > > sensor, business unit or subsidiary for example.
> >> > >
> >> > > I've been wanted to propose extending our 7 standard fields to
> include
> >> > > things like URL and user. Is there community 

Re: [PROPOSAL] up-to-date versioned documentation

2017-01-13 Thread Nick Allen
+1 I think it is sorely needed.

If we can come up with a really slick solution like Spark, then great. I am
also not against a half-baked solution that can later evolve into something
else.  For example, create an index README.md that links together all the
existing READMEs and run Pandoc on it.  Not ideal, but way better than what
we have.



On Fri, Jan 13, 2017 at 9:53 AM, Otto Fowler 
wrote:

> I think something that does what you have laid out here, no matter the
> implementation details would be ideal
>
>
> On January 12, 2017 at 18:05:24, Matt Foley (ma...@apache.org) wrote:
>
> We currently have three forms of documentation, with the following
> advantages and disadvantages:
>
> || Docs || Pro || Con ||
> | CWiki |
> Easy to edit, no special tools required, don't have to be a developer to
> contribute, google and wiki search |
> Not versioned, no review process, distant from the code, obsolete content
> tends to accumulate |
> | Site |
> Versioned and reviewed, only committers can edit, google search |
> Yet another arcane toolset must be learned, only web programmers feel
> comfortable contributing, "asf-site" branch not related to code versions,
> distant from the code, tends to go obsolete due to non-maintenance |
> | README.md |
> Versioned and reviewed, only committers can edit, tied to code versions,
> highly local to the code being documented |
> Non-developers don't know about them, may be scared by github, poor scoring
> in google search, no high-level presentation |
>
> Various discussion threads indicate the developer community likes
> README-based docs, and it's easy to see why from the above. I propose this
> extension to the README-based documentation, to address their
> disadvantages:
>
> 1. Produce a script that gathers the README.md files from all code
> subdirectories into a hierarchical list. The script would have an exclusion
> list for non-user-content, which at this point would consist of [site/*,
> build_utils/*]. The hierarchy would be sorted depth-first. The resulting
> hierarchical list at this time (with six added README files to complete the
> hierarchy) would be:
>
> ./README.md
> ./metron-analytics/README.md <== (need file here)
> ./metron-analytics/metron-maas-service/README.md
> ./metron-analytics/metron-profiler/README.md
> ./metron-analytics/metron-profiler-client/README.md
> ./metron-analytics/metron-statistics/README.md
> ./metron-deployment/README.md
> ./metron-deployment/amazon-ec2/README.md
> ./metron-deployment/packaging/README.md <== (need file here)
> ./metron-deployment/packaging/ambari/README.md <== (need file here)
> ./metron-deployment/packaging/docker/ansible-docker/README.md
> ./metron-deployment/packaging/docker/rpm-docker/README.md
> ./metron-deployment/packer-build/README.md
> ./metron-deployment/roles/ <== (need file here)
> ./metron-deployment/roles/kibana/README.md
> ./metron-deployment/roles/monit/README.md
> ./metron-deployment/roles/opentaxii/README.md
> ./metron-deployment/roles/pcap_replay/README.md
> ./metron-deployment/roles/sensor-test-mode/README.md
> ./metron-deployment/vagrant/README.md <== (need file here)
> ./metron-deployment/vagrant/codelab-platform/README.md
> ./metron-deployment/vagrant/fastcapa-test-platform/README.md
> ./metron-deployment/vagrant/full-dev-platform/README.md
> ./metron-deployment/vagrant/quick-dev-platform/README.md
> ./metron-platform/README.md
> ./metron-platform/metron-api/README.md
> ./metron-platform/metron-common/README.md
> ./metron-platform/metron-data-management/README.md
> ./metron-platform/metron-enrichment/README.md
> ./metron-platform/metron-indexing/README.md
> ./metron-platform/metron-management/README.md
> ./metron-platform/metron-parsers/README.md
> ./metron-platform/metron-pcap-backend/README.md
> ./metron-sensors/README.md <== (need file here)
> ./metron-sensors/fastcapa/README.md
> ./metron-sensors/pycapa/README.md
>
> 2. Arrange to run this script as part of the build process, and commit the
> resulting hierarchy list to someplace in the versioned and branched ./site/
> subdirectory.
>
> 3. Produce a "doc reader" web page that takes in this hierarchy of .md
> pages, and presents a LHS doc tree of links, and a main display area for a
> currently selected file. If we want to get fancy, this page would also
> provide: (a) telescoping (collapse/expand) of the doc tree; (b) floating
> next/prev/up/home buttons in the display area.
>
> #4. Add to this web page a pull-down menu that selects among all the
> release versions of Metron, and (if not running in the Apache site) a
> SNAPSHOT version for the current filesystem version (for developer
> preview). Let it re-write the file paths per release version to the proper
> release tag in github. This web page will therefore be version-independent.
> Put it in the asf-site branch of the Apache site, as the new "docs"
> sub-site from the home web page. Update the list of releases at each
> release, or if we want to get fancy, 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Casey Stella
Polling the Ambari server via REST (or their API if they have one), would
entail all workers hitting one server and create a single point of failure
(the ambari server is what serves up REST).  Zookeeper's intent is to not
have a single point of failure like this and (one of its main) use-cases is
to serve up configs in a distributed environment.

Casey

On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen  wrote:

> Let me ask a stupid question.  What does Zookeeper do for us that Ambari
> cannot?  Why keep Zookeeper in the mix?
>
>
>
> On Fri, Jan 13, 2017 at 9:28 AM, David Lyle  wrote:
>
> > In the main yes- I've made some changes:
> >
> >  - Expand ambari to manage the remaining sensor-specific configs
> >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
> >to push to ambari and take an Ambari user/pw and (optionally) reason
> >  - (Ambari can push to zookeeper, but it requires a service restart, so
> for
> > "live changes" you may
> > want do both a rest call and zookeeper update from
> ConfigurationUtils)
> > WAS
> > Question remains about whether ambari can do the push to zookeeper
> > or whetheror whether ConfigurationUtils has to push to zookeeper as
> > well as update
> > ambari.
> >   - Refactor the middleware that Ryan submitted to have the API calls
> take
> >  an Ambari user/pw and (optionally) reason
> >   - Refactor the management UI to pass in an Ambari user/pw and
> > (optionally) reason
> >   - Refactor the Stellar Management functions CONFIG_PUT to accept an
> > Ambari user/pw and (optionally) reason
> >
> > I think we'd need to do some detailed design around how to handle what we
> > expect to be dynamic configs, but the main principle should (imo) be to
> > always know who and why and make sure that Ambari is aware and is the
> > static backing store for Zookeeper.
> >
> > -D...
> >
> >
> > On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella 
> wrote:
> >
> > > So, basically, your proposed changes, broken into tangible gobbets of
> > work:
> > >
> > >- Expand ambari to manage the remaining sensor-specific configs
> > >- Refactor the push calls to zookeeper (in ConfigurationUtils, I
> > think)
> > >to push to ambari and take a reason
> > >   - Question remains about whether ambari can do the push to
> > zookeeper
> > >   or whether ConfigurationUtils has to push to zookeeper as well as
> > > update
> > >   ambari.
> > >- Refactor the middleware that Ryan submitted to have the API calls
> > take
> > >a reason
> > >- Refactor the management UI to pass in a reason
> > >- Refactor the Stellar Management functions CONFIG_PUT to accept a
> > > reason
> > >
> > > Just so we can evaluate it and I can ensure I haven't overlooked some
> > > important point.  Please tell me if Ambari cannot do the things we're
> > > suggesting it can do.
> > >
> > > Casey
> > >
> > > On Fri, Jan 13, 2017 at 9:15 AM, David Lyle 
> > wrote:
> > >
> > > > That's exactly correct, Casey. Basically, an expansion of what we're
> > > > currently doing with global.json, enrichment.properties and
> > > > elasticsearch.properties.
> > > >
> > > > -D...
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella 
> > > wrote:
> > > >
> > > > > I would suggest not having Ambari replace zookeeper.  I think the
> > > > proposal
> > > > > is to have Ambari replace the editable store (like the JSON files
> on
> > > > > disk).  Zookeeper woudl be the source of truth for the running
> > > topologies
> > > > > and ambari would be sync'd to it.
> > > > >
> > > > > Correct if I misspeak, dave or matt.
> > > > >
> > > > > Casey
> > > > >
> > > > > On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen 
> > > wrote:
> > > > >
> > > > > > Ambari seems like a logical choice.
> > > > > >
> > > > > > *>> It doesn’t natively integrate Zookeeper storage of configs,
> but
> > > > there
> > > > > > is a natural place to specify copy to/from Zookeeper for the
> files
> > > > > > desired.*
> > > > > >
> > > > > > How would Ambari interact with Zookeeper in this scenario?  Would
> > > > Ambari
> > > > > > replace Zookeeper completely? Or would Zookeeper act as the
> > > persistence
> > > > > > tier under Ambari?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 12, 2017 at 9:24 PM, Matt Foley 
> > > wrote:
> > > > > >
> > > > > > > Mike, could you try again on the image, please, making sure it
> > is a
> > > > > > simple
> > > > > > > format (gif, png, or jpeg)?  It got munched, at least in my
> > viewer.
> > > > > > Thanks.
> > > > > > >
> > > > > > > Casey, responding to some of the questions you raised:
> > > > > > >
> > > > > > > I’m going to make a rather strong statement:  We already have a
> > > > service
> > > > > > > “to intermediate and handle config update/retrieval”.
> > > > > > > Furthermore, it:
> 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler
Like most things, we are best off to try something and iterate.  I just
think we should be aware from the beginning ( have tests etc ) of how it
works when there are many filters.


On January 13, 2017 at 10:11:47, Casey Stella (ceste...@gmail.com) wrote:

I imagined one stellar statement and if you wanted an "or" in there, you
could put it there.  I was also planning on doing the JSOn trick of
accepting either a string or list of strings to let you do multiline.  e.g.
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]

Thinks that's a bad idea?

Casey

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler
We also need to account for the complexity of the statements


On January 13, 2017 at 10:27:51, Otto Fowler (ottobackwa...@gmail.com)
wrote:

Like most things, we are best off to try something and iterate.  I just
think we should be aware from the beginning ( have tests etc ) of how it
works when there are many filters.


On January 13, 2017 at 10:11:47, Casey Stella (ceste...@gmail.com) wrote:

I imagined one stellar statement and if you wanted an "or" in there, you
could put it there.  I was also planning on doing the JSOn trick of
accepting either a string or list of strings to let you do multiline.  e.g.
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]

Thinks that's a bad idea?

Casey

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
I imagined one stellar statement and if you wanted an "or" in there, you
could put it there.  I was also planning on doing the JSOn trick of
accepting either a string or list of strings to let you do multiline.  e.g.
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]

Thinks that's a bad idea?

Casey

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you
> don’t
> > end up with a nasty elastic search situation and so you can mine the
> data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate
> dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we need to consider. Take a
> simple
> > >> > > example... If I have data coming in from 3 different firewall
> > vendors
> > >> > and 2
> > >> > > different web proxy/url filtering vendors and I want to be able
> to
> > >> > analyze
> > >> > > that 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Carolyn Duby
ZooKeeper is more efficient if you want to maintain an update to the topologies 
without requiring a restart.  Is this useful going forward?   I think it is for 
development but production environments you would generally only be updating 
during a maintenance window so requiring a restart is not horrible.  

Outside of configuration sharing, ZooKeeper is essential for coordinating 
clustered solutions.  For example leader election in an HA cluster or for 
distributing worker assignments.

Thanks
Carolyn



On 1/13/17, 10:14 AM, "Casey Stella"  wrote:

>Polling the Ambari server via REST (or their API if they have one), would
>entail all workers hitting one server and create a single point of failure
>(the ambari server is what serves up REST).  Zookeeper's intent is to not
>have a single point of failure like this and (one of its main) use-cases is
>to serve up configs in a distributed environment.
>
>Casey
>
>On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen  wrote:
>
>> Let me ask a stupid question.  What does Zookeeper do for us that Ambari
>> cannot?  Why keep Zookeeper in the mix?
>>
>>
>>
>> On Fri, Jan 13, 2017 at 9:28 AM, David Lyle  wrote:
>>
>> > In the main yes- I've made some changes:
>> >
>> >  - Expand ambari to manage the remaining sensor-specific configs
>> >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
>> >to push to ambari and take an Ambari user/pw and (optionally) reason
>> >  - (Ambari can push to zookeeper, but it requires a service restart, so
>> for
>> > "live changes" you may
>> > want do both a rest call and zookeeper update from
>> ConfigurationUtils)
>> > WAS
>> > Question remains about whether ambari can do the push to zookeeper
>> > or whetheror whether ConfigurationUtils has to push to zookeeper as
>> > well as update
>> > ambari.
>> >   - Refactor the middleware that Ryan submitted to have the API calls
>> take
>> >  an Ambari user/pw and (optionally) reason
>> >   - Refactor the management UI to pass in an Ambari user/pw and
>> > (optionally) reason
>> >   - Refactor the Stellar Management functions CONFIG_PUT to accept an
>> > Ambari user/pw and (optionally) reason
>> >
>> > I think we'd need to do some detailed design around how to handle what we
>> > expect to be dynamic configs, but the main principle should (imo) be to
>> > always know who and why and make sure that Ambari is aware and is the
>> > static backing store for Zookeeper.
>> >
>> > -D...
>> >
>> >
>> > On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella 
>> wrote:
>> >
>> > > So, basically, your proposed changes, broken into tangible gobbets of
>> > work:
>> > >
>> > >- Expand ambari to manage the remaining sensor-specific configs
>> > >- Refactor the push calls to zookeeper (in ConfigurationUtils, I
>> > think)
>> > >to push to ambari and take a reason
>> > >   - Question remains about whether ambari can do the push to
>> > zookeeper
>> > >   or whether ConfigurationUtils has to push to zookeeper as well as
>> > > update
>> > >   ambari.
>> > >- Refactor the middleware that Ryan submitted to have the API calls
>> > take
>> > >a reason
>> > >- Refactor the management UI to pass in a reason
>> > >- Refactor the Stellar Management functions CONFIG_PUT to accept a
>> > > reason
>> > >
>> > > Just so we can evaluate it and I can ensure I haven't overlooked some
>> > > important point.  Please tell me if Ambari cannot do the things we're
>> > > suggesting it can do.
>> > >
>> > > Casey
>> > >
>> > > On Fri, Jan 13, 2017 at 9:15 AM, David Lyle 
>> > wrote:
>> > >
>> > > > That's exactly correct, Casey. Basically, an expansion of what we're
>> > > > currently doing with global.json, enrichment.properties and
>> > > > elasticsearch.properties.
>> > > >
>> > > > -D...
>> > > >
>> > > >
>> > > > On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella 
>> > > wrote:
>> > > >
>> > > > > I would suggest not having Ambari replace zookeeper.  I think the
>> > > > proposal
>> > > > > is to have Ambari replace the editable store (like the JSON files
>> on
>> > > > > disk).  Zookeeper woudl be the source of truth for the running
>> > > topologies
>> > > > > and ambari would be sync'd to it.
>> > > > >
>> > > > > Correct if I misspeak, dave or matt.
>> > > > >
>> > > > > Casey
>> > > > >
>> > > > > On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen 
>> > > wrote:
>> > > > >
>> > > > > > Ambari seems like a logical choice.
>> > > > > >
>> > > > > > *>> It doesn’t natively integrate Zookeeper storage of configs,
>> but
>> > > > there
>> > > > > > is a natural place to specify copy to/from Zookeeper for the
>> files
>> > > > > > desired.*
>> > > > > >
>> > > > > > How would Ambari interact with Zookeeper in this scenario?  Would
>> > > > Ambari
>> > > > > > replace Zookeeper completely? Or would Zookeeper act as the
>> > 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Casey Stella
No, it was good to bring up, Nick.  I might have it wrong re: Ambari.

Casey

On Fri, Jan 13, 2017 at 10:27 AM, Nick Allen  wrote:

> That makes sense.  I wasn't sure based on Matt's original
> suggestion/description of Ambari, whether that was something that Ambari
> had also designed for or not.
>
> On Fri, Jan 13, 2017 at 10:14 AM, Casey Stella  wrote:
>
> > Polling the Ambari server via REST (or their API if they have one), would
> > entail all workers hitting one server and create a single point of
> failure
> > (the ambari server is what serves up REST).  Zookeeper's intent is to not
> > have a single point of failure like this and (one of its main) use-cases
> is
> > to serve up configs in a distributed environment.
> >
> > Casey
> >
> > On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen  wrote:
> >
> > > Let me ask a stupid question.  What does Zookeeper do for us that
> Ambari
> > > cannot?  Why keep Zookeeper in the mix?
> > >
> > >
> > >
> > > On Fri, Jan 13, 2017 at 9:28 AM, David Lyle 
> > wrote:
> > >
> > > > In the main yes- I've made some changes:
> > > >
> > > >  - Expand ambari to manage the remaining sensor-specific configs
> > > >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I
> > think)
> > > >to push to ambari and take an Ambari user/pw and (optionally)
> reason
> > > >  - (Ambari can push to zookeeper, but it requires a service restart,
> so
> > > for
> > > > "live changes" you may
> > > > want do both a rest call and zookeeper update from
> > > ConfigurationUtils)
> > > > WAS
> > > > Question remains about whether ambari can do the push to
> zookeeper
> > > > or whetheror whether ConfigurationUtils has to push to zookeeper
> as
> > > > well as update
> > > > ambari.
> > > >   - Refactor the middleware that Ryan submitted to have the API calls
> > > take
> > > >  an Ambari user/pw and (optionally) reason
> > > >   - Refactor the management UI to pass in an Ambari user/pw and
> > > > (optionally) reason
> > > >   - Refactor the Stellar Management functions CONFIG_PUT to accept an
> > > > Ambari user/pw and (optionally) reason
> > > >
> > > > I think we'd need to do some detailed design around how to handle
> what
> > we
> > > > expect to be dynamic configs, but the main principle should (imo) be
> to
> > > > always know who and why and make sure that Ambari is aware and is the
> > > > static backing store for Zookeeper.
> > > >
> > > > -D...
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella 
> > > wrote:
> > > >
> > > > > So, basically, your proposed changes, broken into tangible gobbets
> of
> > > > work:
> > > > >
> > > > >- Expand ambari to manage the remaining sensor-specific configs
> > > > >- Refactor the push calls to zookeeper (in ConfigurationUtils, I
> > > > think)
> > > > >to push to ambari and take a reason
> > > > >   - Question remains about whether ambari can do the push to
> > > > zookeeper
> > > > >   or whether ConfigurationUtils has to push to zookeeper as
> well
> > as
> > > > > update
> > > > >   ambari.
> > > > >- Refactor the middleware that Ryan submitted to have the API
> > calls
> > > > take
> > > > >a reason
> > > > >- Refactor the management UI to pass in a reason
> > > > >- Refactor the Stellar Management functions CONFIG_PUT to
> accept a
> > > > > reason
> > > > >
> > > > > Just so we can evaluate it and I can ensure I haven't overlooked
> some
> > > > > important point.  Please tell me if Ambari cannot do the things
> we're
> > > > > suggesting it can do.
> > > > >
> > > > > Casey
> > > > >
> > > > > On Fri, Jan 13, 2017 at 9:15 AM, David Lyle 
> > > > wrote:
> > > > >
> > > > > > That's exactly correct, Casey. Basically, an expansion of what
> > we're
> > > > > > currently doing with global.json, enrichment.properties and
> > > > > > elasticsearch.properties.
> > > > > >
> > > > > > -D...
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella <
> ceste...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > I would suggest not having Ambari replace zookeeper.  I think
> the
> > > > > > proposal
> > > > > > > is to have Ambari replace the editable store (like the JSON
> files
> > > on
> > > > > > > disk).  Zookeeper woudl be the source of truth for the running
> > > > > topologies
> > > > > > > and ambari would be sync'd to it.
> > > > > > >
> > > > > > > Correct if I misspeak, dave or matt.
> > > > > > >
> > > > > > > Casey
> > > > > > >
> > > > > > > On Fri, Jan 13, 2017 at 9:09 AM, Nick Allen <
> n...@nickallen.org>
> > > > > wrote:
> > > > > > >
> > > > > > > > Ambari seems like a logical choice.
> > > > > > > >
> > > > > > > > *>> It doesn’t natively integrate Zookeeper storage of
> configs,
> > > but
> > > > > > there
> > > > > > > > is a natural place to specify copy to/from Zookeeper for the
> > > 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen
Are you saying we support all of these variants?  I realize you are trying
to have some backwards compatibility, but this also makes it harder for a
user to grok (for me at least).

Personally I like my original example as there are fewer sub-structures,
like 'writerConfig', which makes the whole thing simpler and easier to
grok.  But maybe others will think your proposal is just as easy to grok.



On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:

> Ok, so here's what I'm thinking based on the discussion:
>
>- Keeping the configs that we have now (batchSize and index) as defaults
>for the unspecified writer-specific case
>- Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
>- all writers write all messages
>- index named the same as the sensor for all writers
>- batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
>   "index" : "foo"
>  ,"batchSize" : 100
> }
>
>- All writers write all messages
>- index is named "foo", different from the sensor for all writers
>- batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
>   "index" : "foo"
>  ,"batchSize" : 1
>  , "writerConfig" :
>{
>   "elasticsearch" : {
>"batchSize" : 100
>  }
>}
> }
>
>- All writers write all messages
>- index is named "foo", different from the sensor for all writers
>- batchSize is 1 for HDFS and 100 for elasticsearch writers
>- NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
>   "index" : "foo"
>  ,"batchSize" : 1
>  , "writerConfig" :
>{
>   "elasticsearch" : {
>"batchSize" : 100,
>"when" : "exists(field1)"
>  },
>   "hdfs" : {
>  "when" : "false"
>   }
>}
> }
>
>- ES writer writes messages which have field1, HDFS doesn't
>- index is named "foo", different from the sensor for all writers
>- 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides.  Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings?  For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >>   "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >>   },
> > >>
> > >>   "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >>   }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too.  Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread.  Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be.  Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields.  I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty).  I exchange
> > >> > positions fairly regularly on that question. ;)  It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Ryan Merriman
The driver for using Zookeeper is that it is asynchronous and accepts
callbacks.  Ambari would need to have that capability, otherwise we have to
poll which is a deal breaker in my opinion.

On Fri, Jan 13, 2017 at 9:28 AM, Casey Stella  wrote:

> No, it was good to bring up, Nick.  I might have it wrong re: Ambari.
>
> Casey
>
> On Fri, Jan 13, 2017 at 10:27 AM, Nick Allen  wrote:
>
> > That makes sense.  I wasn't sure based on Matt's original
> > suggestion/description of Ambari, whether that was something that Ambari
> > had also designed for or not.
> >
> > On Fri, Jan 13, 2017 at 10:14 AM, Casey Stella 
> wrote:
> >
> > > Polling the Ambari server via REST (or their API if they have one),
> would
> > > entail all workers hitting one server and create a single point of
> > failure
> > > (the ambari server is what serves up REST).  Zookeeper's intent is to
> not
> > > have a single point of failure like this and (one of its main)
> use-cases
> > is
> > > to serve up configs in a distributed environment.
> > >
> > > Casey
> > >
> > > On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen 
> wrote:
> > >
> > > > Let me ask a stupid question.  What does Zookeeper do for us that
> > Ambari
> > > > cannot?  Why keep Zookeeper in the mix?
> > > >
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 9:28 AM, David Lyle 
> > > wrote:
> > > >
> > > > > In the main yes- I've made some changes:
> > > > >
> > > > >  - Expand ambari to manage the remaining sensor-specific configs
> > > > >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I
> > > think)
> > > > >to push to ambari and take an Ambari user/pw and (optionally)
> > reason
> > > > >  - (Ambari can push to zookeeper, but it requires a service
> restart,
> > so
> > > > for
> > > > > "live changes" you may
> > > > > want do both a rest call and zookeeper update from
> > > > ConfigurationUtils)
> > > > > WAS
> > > > > Question remains about whether ambari can do the push to
> > zookeeper
> > > > > or whetheror whether ConfigurationUtils has to push to
> zookeeper
> > as
> > > > > well as update
> > > > > ambari.
> > > > >   - Refactor the middleware that Ryan submitted to have the API
> calls
> > > > take
> > > > >  an Ambari user/pw and (optionally) reason
> > > > >   - Refactor the management UI to pass in an Ambari user/pw and
> > > > > (optionally) reason
> > > > >   - Refactor the Stellar Management functions CONFIG_PUT to accept
> an
> > > > > Ambari user/pw and (optionally) reason
> > > > >
> > > > > I think we'd need to do some detailed design around how to handle
> > what
> > > we
> > > > > expect to be dynamic configs, but the main principle should (imo)
> be
> > to
> > > > > always know who and why and make sure that Ambari is aware and is
> the
> > > > > static backing store for Zookeeper.
> > > > >
> > > > > -D...
> > > > >
> > > > >
> > > > > On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella 
> > > > wrote:
> > > > >
> > > > > > So, basically, your proposed changes, broken into tangible
> gobbets
> > of
> > > > > work:
> > > > > >
> > > > > >- Expand ambari to manage the remaining sensor-specific
> configs
> > > > > >- Refactor the push calls to zookeeper (in
> ConfigurationUtils, I
> > > > > think)
> > > > > >to push to ambari and take a reason
> > > > > >   - Question remains about whether ambari can do the push to
> > > > > zookeeper
> > > > > >   or whether ConfigurationUtils has to push to zookeeper as
> > well
> > > as
> > > > > > update
> > > > > >   ambari.
> > > > > >- Refactor the middleware that Ryan submitted to have the API
> > > calls
> > > > > take
> > > > > >a reason
> > > > > >- Refactor the management UI to pass in a reason
> > > > > >- Refactor the Stellar Management functions CONFIG_PUT to
> > accept a
> > > > > > reason
> > > > > >
> > > > > > Just so we can evaluate it and I can ensure I haven't overlooked
> > some
> > > > > > important point.  Please tell me if Ambari cannot do the things
> > we're
> > > > > > suggesting it can do.
> > > > > >
> > > > > > Casey
> > > > > >
> > > > > > On Fri, Jan 13, 2017 at 9:15 AM, David Lyle <
> dlyle65...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > That's exactly correct, Casey. Basically, an expansion of what
> > > we're
> > > > > > > currently doing with global.json, enrichment.properties and
> > > > > > > elasticsearch.properties.
> > > > > > >
> > > > > > > -D...
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 13, 2017 at 9:12 AM, Casey Stella <
> > ceste...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > I would suggest not having Ambari replace zookeeper.  I think
> > the
> > > > > > > proposal
> > > > > > > > is to have Ambari replace the editable store (like the JSON
> > files
> > > > on
> > > > > > > > disk).  Zookeeper woudl be the source of truth for the
> running
> 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen
Yep, that makes sense, Casey.  I understand multiline is still just the
same when statement.  I was more responding to Otto's concern about dealing
with 50 whens.

In regards to multiline, I don't know if adding that is worth the potential
confusion.  I prefer very simple configs that are stupid simple to grok.  I
don't have a strong opinion on multiline though, so could go either way.




On Fri, Jan 13, 2017 at 10:38 AM, Casey Stella  wrote:

> Nick, Yep, that's the example I showed.  I'm just suggesting that that when
> use the multiline JSON trick here
> .  A single "when"
> statement with a couple "or"'s
> So:
> "when" : [ "exists(field1) or"
>  , "exists(field2) or"
>  , "exists(field3)"
>  ]
> would resolve to "exists(field1) or exists(field2), or exists(field3)", a
> single stellar statement behind the scene because the array is joined with
> space into a single string.
>
> On Fri, Jan 13, 2017 at 10:34 AM, Nick Allen  wrote:
>
> > I was thinking there would only be one 'when' for each output.  So if we
> > have Elasticsearch and HDFS, you would have only 2 'when's.  Each when
> > statement could be as simple or complex as you need.
> >
> > On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
> > wrote:
> >
> > > How does it look with 50 whens
> > >
> > >
> > > On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com)
> > wrote:
> > >
> > > Ok, so here's what I'm thinking based on the discussion:
> > >
> > > - Keeping the configs that we have now (batchSize and index) as
> defaults
> > > for the unspecified writer-specific case
> > > - Adding the config Nick suggested
> > >
> > > *Base Case*:
> > > {
> > > }
> > >
> > > - all writers write all messages
> > > - index named the same as the sensor for all writers
> > > - batchSize of 1 for all writers
> > >
> > > *Writer-non-specific case*:
> > > {
> > > "index" : "foo"
> > > ,"batchSize" : 100
> > > }
> > >
> > > - All writers write all messages
> > > - index is named "foo", different from the sensor for all writers
> > > - batchSize is 100 for all writers
> > >
> > > *Writer-specific case without filters*
> > > {
> > > "index" : "foo"
> > > ,"batchSize" : 1
> > > , "writerConfig" :
> > > {
> > > "elasticsearch" : {
> > > "batchSize" : 100
> > > }
> > > }
> > > }
> > >
> > > - All writers write all messages
> > > - index is named "foo", different from the sensor for all writers
> > > - batchSize is 1 for HDFS and 100 for elasticsearch writers
> > > - NOTE: I could override the index name too
> > >
> > > *Writer-specific case with filters*
> > > {
> > > "index" : "foo"
> > > ,"batchSize" : 1
> > > , "writerConfig" :
> > > {
> > > "elasticsearch" : {
> > > "batchSize" : 100,
> > > "when" : "exists(field1)"
> > > },
> > > "hdfs" : {
> > > "when" : "false"
> > > }
> > > }
> > > }
> > >
> > > - ES writer writes messages which have field1, HDFS doesn't
> > > - index is named "foo", different from the sensor for all writers
> > > - 100 for elasticsearch writers
> > >
> > > Thoughts?
> > >
> > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > > wrote:
> > >
> > > > For larger installations you need to control what is indexed so you
> > don’t
> > > > end up with a nasty elastic search situation and so you can mine the
> > data
> > > > later for reports and training ml models.
> > > >
> > > > Thanks
> > > > Carolyn
> > > >
> > > >
> > > >
> > > >
> > > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > > >
> > > > >OH that's a good idea!
> > > > >
> > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> > wrote:
> > > > >
> > > > >> I like the "Index Filtering" option based on the flexibility that
> it
> > > > >> provides. Should each output (HDFS, ES, etc) have its own
> > > configuration
> > > > >> settings? For example, aren't things like batching handled
> > separately
> > > > for
> > > > >> HDFS versus Elasticsearch?
> > > > >>
> > > > >> Something along the lines of...
> > > > >>
> > > > >> {
> > > > >> "hdfs" : {
> > > > >> "when": "exists(field1)",
> > > > >> "batchSize": 100
> > > > >> },
> > > > >>
> > > > >> "elasticsearch" : {
> > > > >> "when": "true",
> > > > >> "batchSize": 1000,
> > > > >> "index": "squid"
> > > > >> }
> > > > >> }
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  >
> > > > wrote:
> > > > >>
> > > > >> > Yeah, I tend to like the first option too. Any opposition to
> that
> > > > from
> > > > >> > anyone?
> > > > >> >
> > > > >> > The points brought up are good ones and I think that it may be
> > worth
> > > a
> > > > >> > broader discussion of the requirements of indexing in a separate
> > dev
> > > > list
> > > > >> > thread. Maybe a list of desires with coherent use-cases
> justifying
> > > > them
> > > > >> 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
I am suggesting that, yes.  The configs are essentially the same as yours,
except there is an override specified at the top level.  Without that, in
order to specify both HDFS and ES have batch sizes of 100, you have to
explicitly configure each.  It's less that I'm trying to have backwards
compatibility and more that I'm trying to make the majority case easy: both
writers write everything to a specified index name with a specified batch
size (which is what we have now).  Beyond that, I want to allow for
specifying an override for the config on a writer-by-writer basis for those
who need it.

On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:

> Are you saying we support all of these variants?  I realize you are trying
> to have some backwards compatibility, but this also makes it harder for a
> user to grok (for me at least).
>
> Personally I like my original example as there are fewer sub-structures,
> like 'writerConfig', which makes the whole thing simpler and easier to
> grok.  But maybe others will think your proposal is just as easy to grok.
>
>
>
> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:
>
> > Ok, so here's what I'm thinking based on the discussion:
> >
> >- Keeping the configs that we have now (batchSize and index) as
> defaults
> >for the unspecified writer-specific case
> >- Adding the config Nick suggested
> >
> > *Base Case*:
> > {
> > }
> >
> >- all writers write all messages
> >- index named the same as the sensor for all writers
> >- batchSize of 1 for all writers
> >
> > *Writer-non-specific case*:
> > {
> >   "index" : "foo"
> >  ,"batchSize" : 100
> > }
> >
> >- All writers write all messages
> >- index is named "foo", different from the sensor for all writers
> >- batchSize is 100 for all writers
> >
> > *Writer-specific case without filters*
> > {
> >   "index" : "foo"
> >  ,"batchSize" : 1
> >  , "writerConfig" :
> >{
> >   "elasticsearch" : {
> >"batchSize" : 100
> >  }
> >}
> > }
> >
> >- All writers write all messages
> >- index is named "foo", different from the sensor for all writers
> >- batchSize is 1 for HDFS and 100 for elasticsearch writers
> >- NOTE: I could override the index name too
> >
> > *Writer-specific case with filters*
> > {
> >   "index" : "foo"
> >  ,"batchSize" : 1
> >  , "writerConfig" :
> >{
> >   "elasticsearch" : {
> >"batchSize" : 100,
> >"when" : "exists(field1)"
> >  },
> >   "hdfs" : {
> >  "when" : "false"
> >   }
> >}
> > }
> >
> >- ES writer writes messages which have field1, HDFS doesn't
> >- index is named "foo", different from the sensor for all writers
> >- 100 for elasticsearch writers
> >
> > Thoughts?
> >
> > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > wrote:
> >
> > > For larger installations you need to control what is indexed so you
> don’t
> > > end up with a nasty elastic search situation and so you can mine the
> data
> > > later for reports and training ml models.
> > >
> > > Thanks
> > > Carolyn
> > >
> > >
> > >
> > >
> > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > >
> > > >OH that's a good idea!
> > > >
> > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> wrote:
> > > >
> > > >> I like the "Index Filtering" option based on the flexibility that it
> > > >> provides.  Should each output (HDFS, ES, etc) have its own
> > configuration
> > > >> settings?  For example, aren't things like batching handled
> separately
> > > for
> > > >> HDFS versus Elasticsearch?
> > > >>
> > > >> Something along the lines of...
> > > >>
> > > >> {
> > > >>   "hdfs" : {
> > > >> "when": "exists(field1)",
> > > >> "batchSize": 100
> > > >>   },
> > > >>
> > > >>   "elasticsearch" : {
> > > >> "when": "true",
> > > >> "batchSize": 1000,
> > > >> "index": "squid"
> > > >>   }
> > > >> }
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > > wrote:
> > > >>
> > > >> > Yeah, I tend to like the first option too.  Any opposition to that
> > > from
> > > >> > anyone?
> > > >> >
> > > >> > The points brought up are good ones and I think that it may be
> > worth a
> > > >> > broader discussion of the requirements of indexing in a separate
> dev
> > > list
> > > >> > thread.  Maybe a list of desires with coherent use-cases
> justifying
> > > them
> > > >> so
> > > >> > we can think about how this stuff should work and where the
> natural
> > > >> > extension points should be.  Afterall, we need to toe the line
> > between
> > > >> > engineering and overengineering for features nobody will want.
> > > >> >
> > > >> > I'm not 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread Nick Allen
Makes sense, Dave.  I am totally clear on the proposal.  I just wanted to
ask the stupid question to bring the conversation full circle, leave no
stone unturned, insert favorite idiom here.

On Fri, Jan 13, 2017 at 10:46 AM, David Lyle  wrote:

> To be clear- NOBODY is suggesting replacing Zookeeper with Ambari.
>
> So, as a bit of a reset- here's what's being proposed:
>
>  - Expand ambari to manage the remaining sensor-specific configs
>  - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
>to push to ambari and take an Ambari user/pw and (optionally) reason
>  - (Ambari can push to zookeeper, but it requires a service restart, so for
> "live changes" you may
> want do both a rest call and zookeeper update from ConfigurationUtils)
> WAS
> Question remains about whether ambari can do the push to zookeeper
> or whetheror whether ConfigurationUtils has to push to zookeeper as
> well as update
> ambari.
>   - Refactor the middleware that Ryan submitted to have the API calls take
>  an Ambari user/pw and (optionally) reason
>   - Refactor the management UI to pass in an Ambari user/pw and
> (optionally) reason
>   - Refactor the Stellar Management functions CONFIG_PUT to accept an
> Ambari user/pw and (optionally) reason
>
> -D...
>
>
>
> On Fri, Jan 13, 2017 at 10:42 AM, Ryan Merriman 
> wrote:
>
> > The driver for using Zookeeper is that it is asynchronous and accepts
> > callbacks.  Ambari would need to have that capability, otherwise we have
> to
> > poll which is a deal breaker in my opinion.
> >
> > On Fri, Jan 13, 2017 at 9:28 AM, Casey Stella 
> wrote:
> >
> > > No, it was good to bring up, Nick.  I might have it wrong re: Ambari.
> > >
> > > Casey
> > >
> > > On Fri, Jan 13, 2017 at 10:27 AM, Nick Allen 
> wrote:
> > >
> > > > That makes sense.  I wasn't sure based on Matt's original
> > > > suggestion/description of Ambari, whether that was something that
> > Ambari
> > > > had also designed for or not.
> > > >
> > > > On Fri, Jan 13, 2017 at 10:14 AM, Casey Stella 
> > > wrote:
> > > >
> > > > > Polling the Ambari server via REST (or their API if they have one),
> > > would
> > > > > entail all workers hitting one server and create a single point of
> > > > failure
> > > > > (the ambari server is what serves up REST).  Zookeeper's intent is
> to
> > > not
> > > > > have a single point of failure like this and (one of its main)
> > > use-cases
> > > > is
> > > > > to serve up configs in a distributed environment.
> > > > >
> > > > > Casey
> > > > >
> > > > > On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen 
> > > wrote:
> > > > >
> > > > > > Let me ask a stupid question.  What does Zookeeper do for us that
> > > > Ambari
> > > > > > cannot?  Why keep Zookeeper in the mix?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 13, 2017 at 9:28 AM, David Lyle <
> dlyle65...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > In the main yes- I've made some changes:
> > > > > > >
> > > > > > >  - Expand ambari to manage the remaining sensor-specific
> configs
> > > > > > >  - Refactor the push calls to zookeeper (in
> ConfigurationUtils, I
> > > > > think)
> > > > > > >to push to ambari and take an Ambari user/pw and
> (optionally)
> > > > reason
> > > > > > >  - (Ambari can push to zookeeper, but it requires a service
> > > restart,
> > > > so
> > > > > > for
> > > > > > > "live changes" you may
> > > > > > > want do both a rest call and zookeeper update from
> > > > > > ConfigurationUtils)
> > > > > > > WAS
> > > > > > > Question remains about whether ambari can do the push to
> > > > zookeeper
> > > > > > > or whetheror whether ConfigurationUtils has to push to
> > > zookeeper
> > > > as
> > > > > > > well as update
> > > > > > > ambari.
> > > > > > >   - Refactor the middleware that Ryan submitted to have the API
> > > calls
> > > > > > take
> > > > > > >  an Ambari user/pw and (optionally) reason
> > > > > > >   - Refactor the management UI to pass in an Ambari user/pw and
> > > > > > > (optionally) reason
> > > > > > >   - Refactor the Stellar Management functions CONFIG_PUT to
> > accept
> > > an
> > > > > > > Ambari user/pw and (optionally) reason
> > > > > > >
> > > > > > > I think we'd need to do some detailed design around how to
> handle
> > > > what
> > > > > we
> > > > > > > expect to be dynamic configs, but the main principle should
> (imo)
> > > be
> > > > to
> > > > > > > always know who and why and make sure that Ambari is aware and
> is
> > > the
> > > > > > > static backing store for Zookeeper.
> > > > > > >
> > > > > > > -D...
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella <
> > ceste...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > So, basically, your proposed changes, broken into tangible
> > > gobbets
> > > > of
> > > > > > > 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler
This is an excellent point


On January 13, 2017 at 10:54:07, Simon Elliston Ball 
(si...@simonellistonball.com) wrote:

Some thing else to consider here is the possibility of multiple indices within 
a given target technology.  

For example, if I’m indexing data from a given sensor into, say solr, I may 
want it filtered differently into two different indices. This would enable me 
to create different ‘views’ which could have different security settings 
applied in that backend. This would be useful for multi-tenant installs, and 
for differing data privilege levels within an organisation. You could argue 
that this is more a concern for filtering of the results coming out of an 
index, but currently this is a lot harder than using something like the ranger 
solr authorisation plugin to control access at an index by index granularity.  

Essentially, the indexer topology then becomes a filter and router, which 
argues for it being a separate step, before the process which actually writes 
out to each platform. It may also make sense to have a concept of a routing key 
built up by earlier enrichment to allow shuffle control in storm, rather than a 
full stellar statement for routing, to avoid overhead.  

Simon  

> On 13 Jan 2017, at 07:44, Casey Stella  wrote:  
>  
> I am suggesting that, yes. The configs are essentially the same as yours,  
> except there is an override specified at the top level. Without that, in  
> order to specify both HDFS and ES have batch sizes of 100, you have to  
> explicitly configure each. It's less that I'm trying to have backwards  
> compatibility and more that I'm trying to make the majority case easy: both  
> writers write everything to a specified index name with a specified batch  
> size (which is what we have now). Beyond that, I want to allow for  
> specifying an override for the config on a writer-by-writer basis for those  
> who need it.  
>  
> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:  
>  
>> Are you saying we support all of these variants? I realize you are trying  
>> to have some backwards compatibility, but this also makes it harder for a  
>> user to grok (for me at least).  
>>  
>> Personally I like my original example as there are fewer sub-structures,  
>> like 'writerConfig', which makes the whole thing simpler and easier to  
>> grok. But maybe others will think your proposal is just as easy to grok.  
>>  
>>  
>>  
>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:  
>>  
>>> Ok, so here's what I'm thinking based on the discussion:  
>>>  
>>> - Keeping the configs that we have now (batchSize and index) as  
>> defaults  
>>> for the unspecified writer-specific case  
>>> - Adding the config Nick suggested  
>>>  
>>> *Base Case*:  
>>> {  
>>> }  
>>>  
>>> - all writers write all messages  
>>> - index named the same as the sensor for all writers  
>>> - batchSize of 1 for all writers  
>>>  
>>> *Writer-non-specific case*:  
>>> {  
>>> "index" : "foo"  
>>> ,"batchSize" : 100  
>>> }  
>>>  
>>> - All writers write all messages  
>>> - index is named "foo", different from the sensor for all writers  
>>> - batchSize is 100 for all writers  
>>>  
>>> *Writer-specific case without filters*  
>>> {  
>>> "index" : "foo"  
>>> ,"batchSize" : 1  
>>> , "writerConfig" :  
>>> {  
>>> "elasticsearch" : {  
>>> "batchSize" : 100  
>>> }  
>>> }  
>>> }  
>>>  
>>> - All writers write all messages  
>>> - index is named "foo", different from the sensor for all writers  
>>> - batchSize is 1 for HDFS and 100 for elasticsearch writers  
>>> - NOTE: I could override the index name too  
>>>  
>>> *Writer-specific case with filters*  
>>> {  
>>> "index" : "foo"  
>>> ,"batchSize" : 1  
>>> , "writerConfig" :  
>>> {  
>>> "elasticsearch" : {  
>>> "batchSize" : 100,  
>>> "when" : "exists(field1)"  
>>> },  
>>> "hdfs" : {  
>>> "when" : "false"  
>>> }  
>>> }  
>>> }  
>>>  
>>> - ES writer writes messages which have field1, HDFS doesn't  
>>> - index is named "foo", different from the sensor for all writers  
>>> - 100 for elasticsearch writers  
>>>  
>>> Thoughts?  
>>>  
>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby   
>>> wrote:  
>>>  
 For larger installations you need to control what is indexed so you  
>> don’t  
 end up with a nasty elastic search situation and so you can mine the  
>> data  
 later for reports and training ml models.  
  
 Thanks  
 Carolyn  
  
  
  
  
 On 1/13/17, 9:40 AM, "Casey Stella"  wrote:  
  
> OH that's a good idea!  
>  
> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen   
>> wrote:  
>  
>> I like the "Index Filtering" option based on the flexibility that it  
>> provides. Should each output (HDFS, ES, etc) have its own  
>>> configuration  
>> settings? For example, aren't things like 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen
I was thinking there would only be one 'when' for each output.  So if we
have Elasticsearch and HDFS, you would have only 2 'when's.  Each when
statement could be as simple or complex as you need.

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be worth
> a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should have
> a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we need to consider. Take a
> simple
> > >> > > example... If I have data coming in from 3 different firewall
> > vendors
> > >> > and 2
> > >> > > different web proxy/url filtering vendors and I want to be able to
> > >> > analyze
> > >> > > that data set, I need the data to be indexed all together (likely
> in
> > >> > HDFS)
> > >> > > and to have a normalized schema such that IP address, URL, and
> user
> > 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
Nick, Yep, that's the example I showed.  I'm just suggesting that that when
use the multiline JSON trick here
.  A single "when"
statement with a couple "or"'s
So:
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]
would resolve to "exists(field1) or exists(field2), or exists(field3)", a
single stellar statement behind the scene because the array is joined with
space into a single string.

On Fri, Jan 13, 2017 at 10:34 AM, Nick Allen  wrote:

> I was thinking there would only be one 'when' for each output.  So if we
> have Elasticsearch and HDFS, you would have only 2 'when's.  Each when
> statement could be as simple or complex as you need.
>
> On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
> wrote:
>
> > How does it look with 50 whens
> >
> >
> > On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com)
> wrote:
> >
> > Ok, so here's what I'm thinking based on the discussion:
> >
> > - Keeping the configs that we have now (batchSize and index) as defaults
> > for the unspecified writer-specific case
> > - Adding the config Nick suggested
> >
> > *Base Case*:
> > {
> > }
> >
> > - all writers write all messages
> > - index named the same as the sensor for all writers
> > - batchSize of 1 for all writers
> >
> > *Writer-non-specific case*:
> > {
> > "index" : "foo"
> > ,"batchSize" : 100
> > }
> >
> > - All writers write all messages
> > - index is named "foo", different from the sensor for all writers
> > - batchSize is 100 for all writers
> >
> > *Writer-specific case without filters*
> > {
> > "index" : "foo"
> > ,"batchSize" : 1
> > , "writerConfig" :
> > {
> > "elasticsearch" : {
> > "batchSize" : 100
> > }
> > }
> > }
> >
> > - All writers write all messages
> > - index is named "foo", different from the sensor for all writers
> > - batchSize is 1 for HDFS and 100 for elasticsearch writers
> > - NOTE: I could override the index name too
> >
> > *Writer-specific case with filters*
> > {
> > "index" : "foo"
> > ,"batchSize" : 1
> > , "writerConfig" :
> > {
> > "elasticsearch" : {
> > "batchSize" : 100,
> > "when" : "exists(field1)"
> > },
> > "hdfs" : {
> > "when" : "false"
> > }
> > }
> > }
> >
> > - ES writer writes messages which have field1, HDFS doesn't
> > - index is named "foo", different from the sensor for all writers
> > - 100 for elasticsearch writers
> >
> > Thoughts?
> >
> > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > wrote:
> >
> > > For larger installations you need to control what is indexed so you
> don’t
> > > end up with a nasty elastic search situation and so you can mine the
> data
> > > later for reports and training ml models.
> > >
> > > Thanks
> > > Carolyn
> > >
> > >
> > >
> > >
> > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > >
> > > >OH that's a good idea!
> > > >
> > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> wrote:
> > > >
> > > >> I like the "Index Filtering" option based on the flexibility that it
> > > >> provides. Should each output (HDFS, ES, etc) have its own
> > configuration
> > > >> settings? For example, aren't things like batching handled
> separately
> > > for
> > > >> HDFS versus Elasticsearch?
> > > >>
> > > >> Something along the lines of...
> > > >>
> > > >> {
> > > >> "hdfs" : {
> > > >> "when": "exists(field1)",
> > > >> "batchSize": 100
> > > >> },
> > > >>
> > > >> "elasticsearch" : {
> > > >> "when": "true",
> > > >> "batchSize": 1000,
> > > >> "index": "squid"
> > > >> }
> > > >> }
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > > wrote:
> > > >>
> > > >> > Yeah, I tend to like the first option too. Any opposition to that
> > > from
> > > >> > anyone?
> > > >> >
> > > >> > The points brought up are good ones and I think that it may be
> worth
> > a
> > > >> > broader discussion of the requirements of indexing in a separate
> dev
> > > list
> > > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > > them
> > > >> so
> > > >> > we can think about how this stuff should work and where the
> natural
> > > >> > extension points should be. Afterall, we need to toe the line
> > between
> > > >> > engineering and overengineering for features nobody will want.
> > > >> >
> > > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > > >> between
> > > >> > the notions that we should have no standard fields vs we should
> have
> > a
> > > >> > boatload of standard fields (with most of them empty). I exchange
> > > >> > positions fairly regularly on that question. ;) It may be worth a
> > dev
> > > >> list
> > > >> > discussion to lay out how you imagine an extension of standard
> > fields
> > > and
> > > >> > how it might look as implemented in Metron.
> > > >> >
> > > >> > Casey
> > > 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread David Lyle
To be clear- NOBODY is suggesting replacing Zookeeper with Ambari.

So, as a bit of a reset- here's what's being proposed:

 - Expand ambari to manage the remaining sensor-specific configs
 - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
   to push to ambari and take an Ambari user/pw and (optionally) reason
 - (Ambari can push to zookeeper, but it requires a service restart, so for
"live changes" you may
want do both a rest call and zookeeper update from ConfigurationUtils)
WAS
Question remains about whether ambari can do the push to zookeeper
or whetheror whether ConfigurationUtils has to push to zookeeper as
well as update
ambari.
  - Refactor the middleware that Ryan submitted to have the API calls take
 an Ambari user/pw and (optionally) reason
  - Refactor the management UI to pass in an Ambari user/pw and
(optionally) reason
  - Refactor the Stellar Management functions CONFIG_PUT to accept an
Ambari user/pw and (optionally) reason

-D...



On Fri, Jan 13, 2017 at 10:42 AM, Ryan Merriman  wrote:

> The driver for using Zookeeper is that it is asynchronous and accepts
> callbacks.  Ambari would need to have that capability, otherwise we have to
> poll which is a deal breaker in my opinion.
>
> On Fri, Jan 13, 2017 at 9:28 AM, Casey Stella  wrote:
>
> > No, it was good to bring up, Nick.  I might have it wrong re: Ambari.
> >
> > Casey
> >
> > On Fri, Jan 13, 2017 at 10:27 AM, Nick Allen  wrote:
> >
> > > That makes sense.  I wasn't sure based on Matt's original
> > > suggestion/description of Ambari, whether that was something that
> Ambari
> > > had also designed for or not.
> > >
> > > On Fri, Jan 13, 2017 at 10:14 AM, Casey Stella 
> > wrote:
> > >
> > > > Polling the Ambari server via REST (or their API if they have one),
> > would
> > > > entail all workers hitting one server and create a single point of
> > > failure
> > > > (the ambari server is what serves up REST).  Zookeeper's intent is to
> > not
> > > > have a single point of failure like this and (one of its main)
> > use-cases
> > > is
> > > > to serve up configs in a distributed environment.
> > > >
> > > > Casey
> > > >
> > > > On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen 
> > wrote:
> > > >
> > > > > Let me ask a stupid question.  What does Zookeeper do for us that
> > > Ambari
> > > > > cannot?  Why keep Zookeeper in the mix?
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 13, 2017 at 9:28 AM, David Lyle 
> > > > wrote:
> > > > >
> > > > > > In the main yes- I've made some changes:
> > > > > >
> > > > > >  - Expand ambari to manage the remaining sensor-specific configs
> > > > > >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I
> > > > think)
> > > > > >to push to ambari and take an Ambari user/pw and (optionally)
> > > reason
> > > > > >  - (Ambari can push to zookeeper, but it requires a service
> > restart,
> > > so
> > > > > for
> > > > > > "live changes" you may
> > > > > > want do both a rest call and zookeeper update from
> > > > > ConfigurationUtils)
> > > > > > WAS
> > > > > > Question remains about whether ambari can do the push to
> > > zookeeper
> > > > > > or whetheror whether ConfigurationUtils has to push to
> > zookeeper
> > > as
> > > > > > well as update
> > > > > > ambari.
> > > > > >   - Refactor the middleware that Ryan submitted to have the API
> > calls
> > > > > take
> > > > > >  an Ambari user/pw and (optionally) reason
> > > > > >   - Refactor the management UI to pass in an Ambari user/pw and
> > > > > > (optionally) reason
> > > > > >   - Refactor the Stellar Management functions CONFIG_PUT to
> accept
> > an
> > > > > > Ambari user/pw and (optionally) reason
> > > > > >
> > > > > > I think we'd need to do some detailed design around how to handle
> > > what
> > > > we
> > > > > > expect to be dynamic configs, but the main principle should (imo)
> > be
> > > to
> > > > > > always know who and why and make sure that Ambari is aware and is
> > the
> > > > > > static backing store for Zookeeper.
> > > > > >
> > > > > > -D...
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 13, 2017 at 9:19 AM, Casey Stella <
> ceste...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > So, basically, your proposed changes, broken into tangible
> > gobbets
> > > of
> > > > > > work:
> > > > > > >
> > > > > > >- Expand ambari to manage the remaining sensor-specific
> > configs
> > > > > > >- Refactor the push calls to zookeeper (in
> > ConfigurationUtils, I
> > > > > > think)
> > > > > > >to push to ambari and take a reason
> > > > > > >   - Question remains about whether ambari can do the push
> to
> > > > > > zookeeper
> > > > > > >   or whether ConfigurationUtils has to push to zookeeper as
> > > well
> > > > as
> > > > > > > update
> > > > > > >   ambari.
> > > > > > >- 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Simon Elliston Ball
Some thing else to consider here is the possibility of multiple indices within 
a given target technology.

For example, if I’m indexing data from a given sensor into, say solr, I may 
want it filtered differently into two different indices. This would enable me 
to create different ‘views’ which could have different security settings 
applied in that backend. This would be useful for multi-tenant installs, and 
for differing data privilege levels within an organisation. You could argue 
that this is more a concern for filtering of the results coming out of an 
index, but currently this is a lot harder than using something like the ranger 
solr authorisation plugin to control access at an index by index granularity. 

Essentially, the indexer topology then becomes a filter and router, which 
argues for it being a separate step, before the process which actually writes 
out to each platform. It may also make sense to have a concept of a routing key 
built up by earlier enrichment to allow shuffle control in storm, rather than a 
full stellar statement for routing, to avoid overhead.

Simon

> On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> 
> I am suggesting that, yes.  The configs are essentially the same as yours,
> except there is an override specified at the top level.  Without that, in
> order to specify both HDFS and ES have batch sizes of 100, you have to
> explicitly configure each.  It's less that I'm trying to have backwards
> compatibility and more that I'm trying to make the majority case easy: both
> writers write everything to a specified index name with a specified batch
> size (which is what we have now).  Beyond that, I want to allow for
> specifying an override for the config on a writer-by-writer basis for those
> who need it.
> 
> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> 
>> Are you saying we support all of these variants?  I realize you are trying
>> to have some backwards compatibility, but this also makes it harder for a
>> user to grok (for me at least).
>> 
>> Personally I like my original example as there are fewer sub-structures,
>> like 'writerConfig', which makes the whole thing simpler and easier to
>> grok.  But maybe others will think your proposal is just as easy to grok.
>> 
>> 
>> 
>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:
>> 
>>> Ok, so here's what I'm thinking based on the discussion:
>>> 
>>>   - Keeping the configs that we have now (batchSize and index) as
>> defaults
>>>   for the unspecified writer-specific case
>>>   - Adding the config Nick suggested
>>> 
>>> *Base Case*:
>>> {
>>> }
>>> 
>>>   - all writers write all messages
>>>   - index named the same as the sensor for all writers
>>>   - batchSize of 1 for all writers
>>> 
>>> *Writer-non-specific case*:
>>> {
>>>  "index" : "foo"
>>> ,"batchSize" : 100
>>> }
>>> 
>>>   - All writers write all messages
>>>   - index is named "foo", different from the sensor for all writers
>>>   - batchSize is 100 for all writers
>>> 
>>> *Writer-specific case without filters*
>>> {
>>>  "index" : "foo"
>>> ,"batchSize" : 1
>>> , "writerConfig" :
>>>   {
>>>  "elasticsearch" : {
>>>   "batchSize" : 100
>>> }
>>>   }
>>> }
>>> 
>>>   - All writers write all messages
>>>   - index is named "foo", different from the sensor for all writers
>>>   - batchSize is 1 for HDFS and 100 for elasticsearch writers
>>>   - NOTE: I could override the index name too
>>> 
>>> *Writer-specific case with filters*
>>> {
>>>  "index" : "foo"
>>> ,"batchSize" : 1
>>> , "writerConfig" :
>>>   {
>>>  "elasticsearch" : {
>>>   "batchSize" : 100,
>>>   "when" : "exists(field1)"
>>> },
>>>  "hdfs" : {
>>> "when" : "false"
>>>  }
>>>   }
>>> }
>>> 
>>>   - ES writer writes messages which have field1, HDFS doesn't
>>>   - index is named "foo", different from the sensor for all writers
>>>   - 100 for elasticsearch writers
>>> 
>>> Thoughts?
>>> 
>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
>>> wrote:
>>> 
 For larger installations you need to control what is indexed so you
>> don’t
 end up with a nasty elastic search situation and so you can mine the
>> data
 later for reports and training ml models.
 
 Thanks
 Carolyn
 
 
 
 
 On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
 
> OH that's a good idea!
> 
> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
>> wrote:
> 
>> I like the "Index Filtering" option based on the flexibility that it
>> provides.  Should each output (HDFS, ES, etc) have its own
>>> configuration
>> settings?  For example, aren't things like batching handled
>> separately
 for
>> HDFS versus 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread David Lyle
Casey,

Can you give me a level set of what your thinking is now? I think it's
global control of all index types + overrides on a per-type basis. Fwiw,
I'm totally for that, but I want to make sure I'm not imposing my
pre-concieved notions on your consensus-driven ones.

-D

On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella  wrote:

> I am suggesting that, yes.  The configs are essentially the same as yours,
> except there is an override specified at the top level.  Without that, in
> order to specify both HDFS and ES have batch sizes of 100, you have to
> explicitly configure each.  It's less that I'm trying to have backwards
> compatibility and more that I'm trying to make the majority case easy: both
> writers write everything to a specified index name with a specified batch
> size (which is what we have now).  Beyond that, I want to allow for
> specifying an override for the config on a writer-by-writer basis for those
> who need it.
>
> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
>
> > Are you saying we support all of these variants?  I realize you are
> trying
> > to have some backwards compatibility, but this also makes it harder for a
> > user to grok (for me at least).
> >
> > Personally I like my original example as there are fewer sub-structures,
> > like 'writerConfig', which makes the whole thing simpler and easier to
> > grok.  But maybe others will think your proposal is just as easy to grok.
> >
> >
> >
> > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >
> > > Ok, so here's what I'm thinking based on the discussion:
> > >
> > >- Keeping the configs that we have now (batchSize and index) as
> > defaults
> > >for the unspecified writer-specific case
> > >- Adding the config Nick suggested
> > >
> > > *Base Case*:
> > > {
> > > }
> > >
> > >- all writers write all messages
> > >- index named the same as the sensor for all writers
> > >- batchSize of 1 for all writers
> > >
> > > *Writer-non-specific case*:
> > > {
> > >   "index" : "foo"
> > >  ,"batchSize" : 100
> > > }
> > >
> > >- All writers write all messages
> > >- index is named "foo", different from the sensor for all writers
> > >- batchSize is 100 for all writers
> > >
> > > *Writer-specific case without filters*
> > > {
> > >   "index" : "foo"
> > >  ,"batchSize" : 1
> > >  , "writerConfig" :
> > >{
> > >   "elasticsearch" : {
> > >"batchSize" : 100
> > >  }
> > >}
> > > }
> > >
> > >- All writers write all messages
> > >- index is named "foo", different from the sensor for all writers
> > >- batchSize is 1 for HDFS and 100 for elasticsearch writers
> > >- NOTE: I could override the index name too
> > >
> > > *Writer-specific case with filters*
> > > {
> > >   "index" : "foo"
> > >  ,"batchSize" : 1
> > >  , "writerConfig" :
> > >{
> > >   "elasticsearch" : {
> > >"batchSize" : 100,
> > >"when" : "exists(field1)"
> > >  },
> > >   "hdfs" : {
> > >  "when" : "false"
> > >   }
> > >}
> > > }
> > >
> > >- ES writer writes messages which have field1, HDFS doesn't
> > >- index is named "foo", different from the sensor for all writers
> > >- 100 for elasticsearch writers
> > >
> > > Thoughts?
> > >
> > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > > wrote:
> > >
> > > > For larger installations you need to control what is indexed so you
> > don’t
> > > > end up with a nasty elastic search situation and so you can mine the
> > data
> > > > later for reports and training ml models.
> > > >
> > > > Thanks
> > > > Carolyn
> > > >
> > > >
> > > >
> > > >
> > > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > > >
> > > > >OH that's a good idea!
> > > > >
> > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> > wrote:
> > > > >
> > > > >> I like the "Index Filtering" option based on the flexibility that
> it
> > > > >> provides.  Should each output (HDFS, ES, etc) have its own
> > > configuration
> > > > >> settings?  For example, aren't things like batching handled
> > separately
> > > > for
> > > > >> HDFS versus Elasticsearch?
> > > > >>
> > > > >> Something along the lines of...
> > > > >>
> > > > >> {
> > > > >>   "hdfs" : {
> > > > >> "when": "exists(field1)",
> > > > >> "batchSize": 100
> > > > >>   },
> > > > >>
> > > > >>   "elasticsearch" : {
> > > > >> "when": "true",
> > > > >> "batchSize": 1000,
> > > > >> "index": "squid"
> > > > >>   }
> > > > >> }
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  >
> > > > wrote:
> > > > >>
> > > > >> > Yeah, I 

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-13 Thread zeo...@gmail.com
Right, good conversation to bring up for sure.

Just to comment on production generally only being updated during
maintenance windows - I can tell you that my plans are to make my dev,
test, and prod Metron a very dynamic and frequently changing environment
which will have coordinated but frequent modifications and I strongly
prefer not having to restart anywhere that I can.  Of course it will
happen, but keeping it to a minimum is key.

Jon

On Fri, Jan 13, 2017 at 10:53 AM Nick Allen  wrote:

> Makes sense, Dave.  I am totally clear on the proposal.  I just wanted to
> ask the stupid question to bring the conversation full circle, leave no
> stone unturned, insert favorite idiom here.
>
> On Fri, Jan 13, 2017 at 10:46 AM, David Lyle  wrote:
>
> > To be clear- NOBODY is suggesting replacing Zookeeper with Ambari.
> >
> > So, as a bit of a reset- here's what's being proposed:
> >
> >  - Expand ambari to manage the remaining sensor-specific configs
> >  - Refactor the push calls to zookeeper (in ConfigurationUtils, I think)
> >to push to ambari and take an Ambari user/pw and (optionally) reason
> >  - (Ambari can push to zookeeper, but it requires a service restart, so
> for
> > "live changes" you may
> > want do both a rest call and zookeeper update from
> ConfigurationUtils)
> > WAS
> > Question remains about whether ambari can do the push to zookeeper
> > or whetheror whether ConfigurationUtils has to push to zookeeper as
> > well as update
> > ambari.
> >   - Refactor the middleware that Ryan submitted to have the API calls
> take
> >  an Ambari user/pw and (optionally) reason
> >   - Refactor the management UI to pass in an Ambari user/pw and
> > (optionally) reason
> >   - Refactor the Stellar Management functions CONFIG_PUT to accept an
> > Ambari user/pw and (optionally) reason
> >
> > -D...
> >
> >
> >
> > On Fri, Jan 13, 2017 at 10:42 AM, Ryan Merriman 
> > wrote:
> >
> > > The driver for using Zookeeper is that it is asynchronous and accepts
> > > callbacks.  Ambari would need to have that capability, otherwise we
> have
> > to
> > > poll which is a deal breaker in my opinion.
> > >
> > > On Fri, Jan 13, 2017 at 9:28 AM, Casey Stella 
> > wrote:
> > >
> > > > No, it was good to bring up, Nick.  I might have it wrong re: Ambari.
> > > >
> > > > Casey
> > > >
> > > > On Fri, Jan 13, 2017 at 10:27 AM, Nick Allen 
> > wrote:
> > > >
> > > > > That makes sense.  I wasn't sure based on Matt's original
> > > > > suggestion/description of Ambari, whether that was something that
> > > Ambari
> > > > > had also designed for or not.
> > > > >
> > > > > On Fri, Jan 13, 2017 at 10:14 AM, Casey Stella  >
> > > > wrote:
> > > > >
> > > > > > Polling the Ambari server via REST (or their API if they have
> one),
> > > > would
> > > > > > entail all workers hitting one server and create a single point
> of
> > > > > failure
> > > > > > (the ambari server is what serves up REST).  Zookeeper's intent
> is
> > to
> > > > not
> > > > > > have a single point of failure like this and (one of its main)
> > > > use-cases
> > > > > is
> > > > > > to serve up configs in a distributed environment.
> > > > > >
> > > > > > Casey
> > > > > >
> > > > > > On Fri, Jan 13, 2017 at 9:55 AM, Nick Allen 
> > > > wrote:
> > > > > >
> > > > > > > Let me ask a stupid question.  What does Zookeeper do for us
> that
> > > > > Ambari
> > > > > > > cannot?  Why keep Zookeeper in the mix?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 13, 2017 at 9:28 AM, David Lyle <
> > dlyle65...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > In the main yes- I've made some changes:
> > > > > > > >
> > > > > > > >  - Expand ambari to manage the remaining sensor-specific
> > configs
> > > > > > > >  - Refactor the push calls to zookeeper (in
> > ConfigurationUtils, I
> > > > > > think)
> > > > > > > >to push to ambari and take an Ambari user/pw and
> > (optionally)
> > > > > reason
> > > > > > > >  - (Ambari can push to zookeeper, but it requires a service
> > > > restart,
> > > > > so
> > > > > > > for
> > > > > > > > "live changes" you may
> > > > > > > > want do both a rest call and zookeeper update from
> > > > > > > ConfigurationUtils)
> > > > > > > > WAS
> > > > > > > > Question remains about whether ambari can do the push to
> > > > > zookeeper
> > > > > > > > or whetheror whether ConfigurationUtils has to push to
> > > > zookeeper
> > > > > as
> > > > > > > > well as update
> > > > > > > > ambari.
> > > > > > > >   - Refactor the middleware that Ryan submitted to have the
> API
> > > > calls
> > > > > > > take
> > > > > > > >  an Ambari user/pw and (optionally) reason
> > > > > > > >   - Refactor the management UI to pass in an Ambari user/pw
> and
> > > > > > > > (optionally) reason
> > > > > > > >   - Refactor the 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
Dave,
For the benefit of posterity and people who might not be as deeply
entangled in the system as we have been, I'll recap things and hopefully
answer your question in the process.

Historically the index configuration is split between the enrichment
configs and the global configs.

   - The global configs really controls configs that apply to all sensors.
   Historically this has been stuff like index connection strings, etc.
   - The sensor-specific configs which control things that vary by sensor.

As of Metron-652 (in review currently), we moved the sensor specific
configs from the enrichment configs.  The proposal here is to increase the
granularity of the the sensor specific files to make them support index
writer-specific configs.  Right now in the indexing topology, we have 2
writers (fixed): ES/Solr and HDFS.

The proposed configuration would allow you to either specify a blanket
sensor-level config for the index name and batchSize and/or override at the
writer level, thereby supporting a couple of use-cases:

   - Turning off certain index writers (e.g. HDFS)
   - Filtering the messages written to certain index writers

The two competing configs between Nick and I are as follows:

   - I want to make sure we keep the old sensor-specific defaults with
   writer-specific overrides available
   - Nick thought we could simplify the permutations by making the indexing
   config only the writer-level configs.

My concerns about Nick's suggestion were that the default and majority
case, specifying the index and the batchSize for all writers (th eone we
support now) would require more configuration.

Nick's concerns about my suggestion were that it was overly complex and
hard to grok and that we could dispense with backwards compatibility and
make people do a bit more work on the default case for the benefits of a
simpler advanced case. (Nick, make sure I don't misstate your position).

Casey


On Fri, Jan 13, 2017 at 10:54 AM, David Lyle  wrote:

> Casey,
>
> Can you give me a level set of what your thinking is now? I think it's
> global control of all index types + overrides on a per-type basis. Fwiw,
> I'm totally for that, but I want to make sure I'm not imposing my
> pre-concieved notions on your consensus-driven ones.
>
> -D
>
> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella  wrote:
>
> > I am suggesting that, yes.  The configs are essentially the same as
> yours,
> > except there is an override specified at the top level.  Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each.  It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified batch
> > size (which is what we have now).  Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> > > Are you saying we support all of these variants?  I realize you are
> > trying
> > > to have some backwards compatibility, but this also makes it harder
> for a
> > > user to grok (for me at least).
> > >
> > > Personally I like my original example as there are fewer
> sub-structures,
> > > like 'writerConfig', which makes the whole thing simpler and easier to
> > > grok.  But maybe others will think your proposal is just as easy to
> grok.
> > >
> > >
> > >
> > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> > wrote:
> > >
> > > > Ok, so here's what I'm thinking based on the discussion:
> > > >
> > > >- Keeping the configs that we have now (batchSize and index) as
> > > defaults
> > > >for the unspecified writer-specific case
> > > >- Adding the config Nick suggested
> > > >
> > > > *Base Case*:
> > > > {
> > > > }
> > > >
> > > >- all writers write all messages
> > > >- index named the same as the sensor for all writers
> > > >- batchSize of 1 for all writers
> > > >
> > > > *Writer-non-specific case*:
> > > > {
> > > >   "index" : "foo"
> > > >  ,"batchSize" : 100
> > > > }
> > > >
> > > >- All writers write all messages
> > > >- index is named "foo", different from the sensor for all writers
> > > >- batchSize is 100 for all writers
> > > >
> > > > *Writer-specific case without filters*
> > > > {
> > > >   "index" : "foo"
> > > >  ,"batchSize" : 1
> > > >  , "writerConfig" :
> > > >{
> > > >   "elasticsearch" : {
> > > >"batchSize" : 100
> > > >  }
> > > >}
> > > > }
> > > >
> > > >- All writers write all messages
> > > >- index is named "foo", different from the sensor for all writers
> > > >- batchSize is 1 for HDFS and 100 for elasticsearch writers
> > > >- NOTE: I could override the index 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella
Simon,

Great thought.  I had considered it, but didn't want to bite off all that
as part of a PR.  I thought baby-steps for the moment woudl be best.
Perhaps this deserves its own JIRA and discussion?

On Fri, Jan 13, 2017 at 10:53 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Some thing else to consider here is the possibility of multiple indices
> within a given target technology.
>
> For example, if I’m indexing data from a given sensor into, say solr, I
> may want it filtered differently into two different indices. This would
> enable me to create different ‘views’ which could have different security
> settings applied in that backend. This would be useful for multi-tenant
> installs, and for differing data privilege levels within an organisation.
> You could argue that this is more a concern for filtering of the results
> coming out of an index, but currently this is a lot harder than using
> something like the ranger solr authorisation plugin to control access at an
> index by index granularity.
>
> Essentially, the indexer topology then becomes a filter and router, which
> argues for it being a separate step, before the process which actually
> writes out to each platform. It may also make sense to have a concept of a
> routing key built up by earlier enrichment to allow shuffle control in
> storm, rather than a full stellar statement for routing, to avoid overhead.
>
> Simon
>
> > On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> >
> > I am suggesting that, yes.  The configs are essentially the same as
> yours,
> > except there is an override specified at the top level.  Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each.  It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified batch
> > size (which is what we have now).  Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> >> Are you saying we support all of these variants?  I realize you are
> trying
> >> to have some backwards compatibility, but this also makes it harder for
> a
> >> user to grok (for me at least).
> >>
> >> Personally I like my original example as there are fewer sub-structures,
> >> like 'writerConfig', which makes the whole thing simpler and easier to
> >> grok.  But maybe others will think your proposal is just as easy to
> grok.
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >>
> >>> Ok, so here's what I'm thinking based on the discussion:
> >>>
> >>>   - Keeping the configs that we have now (batchSize and index) as
> >> defaults
> >>>   for the unspecified writer-specific case
> >>>   - Adding the config Nick suggested
> >>>
> >>> *Base Case*:
> >>> {
> >>> }
> >>>
> >>>   - all writers write all messages
> >>>   - index named the same as the sensor for all writers
> >>>   - batchSize of 1 for all writers
> >>>
> >>> *Writer-non-specific case*:
> >>> {
> >>>  "index" : "foo"
> >>> ,"batchSize" : 100
> >>> }
> >>>
> >>>   - All writers write all messages
> >>>   - index is named "foo", different from the sensor for all writers
> >>>   - batchSize is 100 for all writers
> >>>
> >>> *Writer-specific case without filters*
> >>> {
> >>>  "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>>   {
> >>>  "elasticsearch" : {
> >>>   "batchSize" : 100
> >>> }
> >>>   }
> >>> }
> >>>
> >>>   - All writers write all messages
> >>>   - index is named "foo", different from the sensor for all writers
> >>>   - batchSize is 1 for HDFS and 100 for elasticsearch writers
> >>>   - NOTE: I could override the index name too
> >>>
> >>> *Writer-specific case with filters*
> >>> {
> >>>  "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>>   {
> >>>  "elasticsearch" : {
> >>>   "batchSize" : 100,
> >>>   "when" : "exists(field1)"
> >>> },
> >>>  "hdfs" : {
> >>> "when" : "false"
> >>>  }
> >>>   }
> >>> }
> >>>
> >>>   - ES writer writes messages which have field1, HDFS doesn't
> >>>   - index is named "foo", different from the sensor for all writers
> >>>   - 100 for elasticsearch writers
> >>>
> >>> Thoughts?
> >>>
> >>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> >>> wrote:
> >>>
>  For larger installations you need to control what is indexed so you
> >> don’t
>  end up with a nasty elastic search situation and so you can mine the
> >> data
>  later for reports and training ml models.
> 
>  Thanks

Re: [DISCUSS] Hosting Kraken maven artifacts in incubator-metron git repo

2017-01-13 Thread Billie Rinaldi
No, we can't host artifacts in a git repo, or on a website. It would be
like distributing a release that hasn't been voted upon.

Regarding message threading, in Gmail adding a [tag] to the subject does
not create a new thread. So the change is not visible in my mailbox unless
the rest of the subject is changed as well.

On Mon, Jan 9, 2017 at 1:00 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> This is a question primarily for the mentors.
>
> *Background*
> metron-common is currently depending on the openSOC github repo for hosting
> kraken artifacts. The original reason for this was that these jars are not
> hosted in Maven Central, and they were not reliably available in the Kraken
> repo. https://issues.apache.org/jira/browse/METRON-650 is tracking work
> around copying these artifacts to the Metron repo.
>
> Kraken source on openSOC - https://github.com/OpenSOC/kraken
> Krake maven repo on openSOC -
> https://github.com/OpenSOC/kraken/tree/mvn-repo
>
> *Ask*
> Create a new branch in incubator-metron to host any necessary maven
> artifacts. This branch would simply be incubator-metron/mvn-repo. This is
> similar to how we've hosted the asf-site.
>
> *Concerns/Questions*
>
>1. Can we host these jars/artifacts in this manner?
>2. Concerns regarding licensing?
>3. Do we need to also grab and host the source code?
>