Re: Metron Alert UI and zero-down time Elasticsearch re-index

2018-01-03 Thread James Sirota
Hi Ali, I am not sure I understand what you are trying to do.  Are you trying 
to change the name on the old index, add it to the alias, and then re-index and 
give the new index the name of the old index? 

01.01.2018, 22:30, "Ali Nazemian" :
> Hi All,
>
> We are using an older version of Metron Alert-UI (Received in Oct 2017)
> which sends search queries to ES directly without using Metron Rest API. We
> wanted to run a zero-downtime ES reindex process by using ES aliasing.
> However, I am not sure how it will impact the search part of Alert-UI
> because we need to change it to refer to the alias instead of the old index
> name. Please advise how it can be covered in the older version of Metron
> Alert-UI.
>
> Regards,
> Ali

--- 
Thank you,

James Sirota
PMC- Apache Metron
jsirota AT apache DOT org


Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread James Sirota
I just went through these pull requests as well and also agree this is good 
work.  I think it's a good first pass.  I would be careful with trying to boil 
the ocean here.  I think for the initial use case I would only support loading 
the bloom filters from HDFS.  If people want to pre-process the CSV file of 
domains using awk or sed this should be out of scope of this work.  It's easy 
enough to do out of band and I would not include any of these functions at all. 
  I also think that the config could be considerably simplified.  I think 
value_filter should be removed (since I believe that preprocessing should be 
done by the user outside of this process).  I also have a question about the 
init, update, and merge configurations.  Would I ever initialize to anything 
but an empty bloom filter?  For the state update would I ever do anything other 
than add to the bloom filter?  For the state merge would I ever do anything 
other than merge the states?  If the answer to these is 'no', then this should 
simply be hard coded and not externalized into config values. 

03.01.2018, 14:20, "Michael Miklavcic" :
> I just finished stepping through the typosquatting use case README in your
> merge branch. This is really, really good work Casey. I see most of our
> previous documentation issues addressed up front, e.g. special variables
> are cited, all new fields explained, side effects documented. The use case
> doc brings it all together soup-to-nuts and I think all the pieces make
> sense in a mostly self-contained way. I can't think of anything I had to
> sit and think about for more than a few seconds. I'll be making my way
> through your individual PR's in more detail, but my first impressions are
> that this is excellent.
>
> On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>>  I'm liking this design and growth strategy, Casey. I also think Nick and
>>  Otto have some valid points. I always find there's a natural tension
>>  between too little, just enough, and boiling the ocean and these discuss
>>  threads really help drive what the short and long term visions should look
>>  like.
>>
>>  On the subject of repositories and strategies, I agree that pluggable
>>  repos and strategies for modifying them would be useful. For the first
>>  pass, I'd really like to see HDFS with the proposed set of Stellar
>>  functions. This gives us a lot of bang for our buck - we can capitalize on
>>  a set of powerful features around existence checking earlier without having
>>  to worry about later interface changes impacting users. With the primary
>>  interface coming through the JSON config, we are building a nice facade
>>  that protects users from later implementation abstractions and
>>  improvements, all while providing a stable enough interface on which we can
>>  develop UI features as desired. I'd be interested to hear more about what
>>  features could be provided by a repository as time goes by. Federation,
>>  permissions, governance, metadata management, perhaps?
>>
>>  I also had some concern over duplicating existing Unix features. I think
>>  where I'm at has been largely addressed by Casey's comments on 1) scaling,
>>  2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches
>>  - 1 which is config-based and the other a composable set of functions gives
>>  us the ability to provide a core set of features that can later be easily
>>  expanded by users as the need arises. Here again I think the prescribed
>>  approach provides a strong first pass that we can then expand on without
>>  concern of future improvements becoming a hassle for end users.
>>
>>  Best,
>>  Mike
>>
>>  On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
>>  si...@simonellistonball.com> wrote:
>>
>>>  There is some really cool stuff happening here, if only I’d been allowed
>>>  to see the lists over Christmas... :)
>>>
>>>  A few thoughts...
>>>
>>>  I like Otto’s generalisation of the problem to include specific local
>>>  stellar objects in a cache loaded from a store (HDFS seems a natural, but
>>>  not only place, maybe even a web service / local microservicey object
>>>  provider!?) That said, I suspect that’s a good platform optimisation
>>>  approach. Should we look at this as a separate piece of work given it
>>>  extends beyond the scope of the summarisation concept and ultimately use it
>>>  as a back-end to feed the summarising engine proposed here for the
>>>  enrichment loader?
>>>
>>>  On the more specific use case, one think I would comment on is the
>>>  configuration approach. The iteration loop (state_{init|update|merge}
>>>  should be consistent with the way we handle things like the profiler
>>>  config, since it’s the same approach to data handling.
>>>
>>>  The other thing that seems to have crept in here is the interface to
>>>  something like Spark, which again, I am really very very keen on seeing
>>>  happen. 

[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/884
  
@merrimanr  Fixed that bug and was able to add a good amount of unit tests 
around it.  Also figured out a way to unit test the Aesh-driven StellarShell 
class.


---


[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/884
  
The problem is on the UI side of things; for both Zeppelin and the CLI.  
When I get the result back from Stellar, I was using 
`ConversionUtils.convert(value, String.class)` to get me a result that I can 
display.  `ConversionUtils` just gives you the first item back if you ask it to 
convert a list to a String.  Oops.


---


[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/884
  
> @merrimanr: From what I can tell the problem is only in how the result is 
displayed. For example, the expression 'BAR' in MAP([ 'foo', 'bar'], (x) -> 
TO_UPPER(x) ) returns true as expected.

BUG!  Good find.  The value getting returned is always the first.  I'll fix 
that.




---


[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/884
  
> @merrimanr: Is there a way to pass in the zookeeper url in Zeppelin?

No, I did not implement that in this PR.  As a next step I was going to do 
whatever needs done to get the management functions working in Zeppelin.  That 
would include adding a Zookeeper URL.





---


[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/884
  
Thanks for digging into it @merrimanr .  You uncovered some good stuff.

> I found that the functions available in Zeppelin are a subset of what's 
available in the Stellar shell. The missing functions include IS_EMAIL, 
ENRICHMENT*, GEO*, STATS* and many others. Is this expected?

Yes, since we only added `stellar-common` to the interpreter, only the 
functions defined in that library are available.

I just updated the README to clarify this point.  I also added instructions 
for adding additional libraries to gain access to more Stellar functions.



---


[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/metron/pull/884
  
I also noticed that functions returning a list only display the first item 
in both the shell and Zeppelin output.

For example, the expression `MAP([ 'foo', 'bar'], (x) -> TO_UPPER(x) )` 
returns `FOO` when I would expect it to return `['FOO', 'BAR']`.  

From what I can tell the problem is only in how the result is displayed.  
For example, the expression `'BAR' in MAP([ 'foo', 'bar'], (x) -> TO_UPPER(x) 
)` returns `true` as expected.


---


[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/metron/pull/884
  
I think this is an excellent start.  So far I have only reviewed it from a 
user perspective and it's working well so far.  

I've spun it up in full dev (not sure that's even necessary) and installed 
this via the instructions in the README.  I was able to run most of the 
examples in the Stellar README.

I found that the functions available in Zeppelin are a subset of what's 
available in the Stellar shell.  The missing functions include IS_EMAIL, 
ENRICHMENT*, GEO*, STATS* and many others.  Is this expected?

Is there a way to pass in the zookeeper url in Zeppelin?



---


Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Michael Miklavcic
I just finished stepping through the typosquatting use case README in your
merge branch. This is really, really good work Casey. I see most of our
previous documentation issues addressed up front, e.g. special variables
are cited, all new fields explained, side effects documented. The use case
doc brings it all together soup-to-nuts and I think all the pieces make
sense in a mostly self-contained way. I can't think of anything I had to
sit and think about for more than a few seconds. I'll be making my way
through your individual PR's in more detail, but my first impressions are
that this is excellent.

On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> I'm liking this design and growth strategy, Casey. I also think Nick and
> Otto have some valid points. I always find there's a natural tension
> between too little, just enough, and boiling the ocean and these discuss
> threads really help drive what the short and long term visions should look
> like.
>
> On the subject of repositories and strategies, I agree that pluggable
> repos and strategies for modifying them would be useful. For the first
> pass, I'd really like to see HDFS with the proposed set of Stellar
> functions. This gives us a lot of bang for our buck - we can capitalize on
> a set of powerful features around existence checking earlier without having
> to worry about later interface changes impacting users. With the primary
> interface coming through the JSON config, we are building a nice facade
> that protects users from later implementation abstractions and
> improvements, all while providing a stable enough interface on which we can
> develop UI features as desired. I'd be interested to hear more about what
> features could be provided by a repository as time goes by. Federation,
> permissions, governance, metadata management, perhaps?
>
> I also had some concern over duplicating existing Unix features. I think
> where I'm at has been largely addressed by Casey's comments on 1) scaling,
> 2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches
> - 1 which is config-based and the other a composable set of functions gives
> us the ability to provide a core set of features that can later be easily
> expanded by users as the need arises. Here again I think the prescribed
> approach provides a strong first pass that we can then expand on without
> concern of future improvements becoming a hassle for end users.
>
> Best,
> Mike
>
> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
>> There is some really cool stuff happening here, if only I’d been allowed
>> to see the lists over Christmas... :)
>>
>> A few thoughts...
>>
>> I like Otto’s generalisation of the problem to include specific local
>> stellar objects in a cache loaded from a store (HDFS seems a natural, but
>> not only place, maybe even a web service / local microservicey object
>> provider!?) That said, I suspect that’s a good platform optimisation
>> approach. Should we look at this as a separate piece of work given it
>> extends beyond the scope of the summarisation concept and ultimately use it
>> as a back-end to feed the summarising engine proposed here for the
>> enrichment loader?
>>
>> On the more specific use case, one think I would comment on is the
>> configuration approach. The iteration loop (state_{init|update|merge}
>> should be consistent with the way we handle things like the profiler
>> config, since it’s the same approach to data handling.
>>
>> The other thing that seems to have crept in here is the interface to
>> something like Spark, which again, I am really very very keen on seeing
>> happen. That said, not sure how that would happen in this context, unless
>> you’re talking about pushing to something like livy for example (eminently
>> sensible for things like cross instance caching and faster RPC-ish access
>> to an existing spark context which seem to be what Casey is driving at with
>> the spark piece.
>>
>> To address the question of text manipulation in Stellar / metron
>> enrichment ingest etc, we already have this outside of the context of the
>> issues here. I would argue that yes, we don’t want too many paths for this,
>> and that maybe our parser approach might be heavily related to text-based
>> ingest. I would say the scope worth dealing with here though is not really
>> text manipulation, but summarisation, which is not well served by existing
>> CLI tools like awk / sed and friends.
>>
>> Simon
>>
>> > On 3 Jan 2018, at 15:48, Nick Allen  wrote:
>> >
>> >> Even with 5 threads, it takes an hour for the full Alexa 1m, so I
>> think
>> > this will impact performance
>> >
>> > What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
>> > seems really high, unless I am not understanding something.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella 
>> wrote:

[GitHub] metron pull request #887: METRON-1388 update public web site to point at 0.4...

2018-01-03 Thread mattf-horton
GitHub user mattf-horton opened a pull request:

https://github.com/apache/metron/pull/887

METRON-1388 update public web site to point at 0.4.2 new release

## Contributor Comments
Update the public web site to point at new release 0.4.2 (to be pushed 
simultaneously with the announcement of the release, currently propagating from 
https://dist.apache.org/repos/dist/release/metron/0.4.2/ )


## Pull Request Checklist

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 
- [x] Does your PR title start with METRON- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
- [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

### For code changes:  N/A

### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in 
which it is rendered by building and verifying the site-book? 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mattf-horton/metron METRON-1388

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/metron/pull/887.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #887


commit d012f45a6f335407c29155a8823b3d249ec30c62
Author: mattf-horton 
Date:   2017-12-19T10:32:59Z

METRON-1373 RAT failure for metron-interface/metron-alerts

commit fe7df464cd851584c886fe61ea099cf17563279f
Author: mattf-horton 
Date:   2018-01-03T20:05:56Z

METRON-1388 update public web site to point at 0.4.2 new release




---


Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Michael Miklavcic
I'm liking this design and growth strategy, Casey. I also think Nick and
Otto have some valid points. I always find there's a natural tension
between too little, just enough, and boiling the ocean and these discuss
threads really help drive what the short and long term visions should look
like.

On the subject of repositories and strategies, I agree that pluggable repos
and strategies for modifying them would be useful. For the first pass, I'd
really like to see HDFS with the proposed set of Stellar functions. This
gives us a lot of bang for our buck - we can capitalize on a set of
powerful features around existence checking earlier without having to worry
about later interface changes impacting users. With the primary interface
coming through the JSON config, we are building a nice facade that protects
users from later implementation abstractions and improvements, all while
providing a stable enough interface on which we can develop UI features as
desired. I'd be interested to hear more about what features could be
provided by a repository as time goes by. Federation, permissions,
governance, metadata management, perhaps?

I also had some concern over duplicating existing Unix features. I think
where I'm at has been largely addressed by Casey's comments on 1) scaling,
2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches
- 1 which is config-based and the other a composable set of functions gives
us the ability to provide a core set of features that can later be easily
expanded by users as the need arises. Here again I think the prescribed
approach provides a strong first pass that we can then expand on without
concern of future improvements becoming a hassle for end users.

Best,
Mike

On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> There is some really cool stuff happening here, if only I’d been allowed
> to see the lists over Christmas... :)
>
> A few thoughts...
>
> I like Otto’s generalisation of the problem to include specific local
> stellar objects in a cache loaded from a store (HDFS seems a natural, but
> not only place, maybe even a web service / local microservicey object
> provider!?) That said, I suspect that’s a good platform optimisation
> approach. Should we look at this as a separate piece of work given it
> extends beyond the scope of the summarisation concept and ultimately use it
> as a back-end to feed the summarising engine proposed here for the
> enrichment loader?
>
> On the more specific use case, one think I would comment on is the
> configuration approach. The iteration loop (state_{init|update|merge}
> should be consistent with the way we handle things like the profiler
> config, since it’s the same approach to data handling.
>
> The other thing that seems to have crept in here is the interface to
> something like Spark, which again, I am really very very keen on seeing
> happen. That said, not sure how that would happen in this context, unless
> you’re talking about pushing to something like livy for example (eminently
> sensible for things like cross instance caching and faster RPC-ish access
> to an existing spark context which seem to be what Casey is driving at with
> the spark piece.
>
> To address the question of text manipulation in Stellar / metron
> enrichment ingest etc, we already have this outside of the context of the
> issues here. I would argue that yes, we don’t want too many paths for this,
> and that maybe our parser approach might be heavily related to text-based
> ingest. I would say the scope worth dealing with here though is not really
> text manipulation, but summarisation, which is not well served by existing
> CLI tools like awk / sed and friends.
>
> Simon
>
> > On 3 Jan 2018, at 15:48, Nick Allen  wrote:
> >
> >> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> > this will impact performance
> >
> > What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> > seems really high, unless I am not understanding something.
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella 
> wrote:
> >
> >> Thanks for the feedback, Nick.
> >>
> >> Regarding "IMHO, I'd rather not reinvent the wheel for text
> manipulation."
> >>
> >> I would argue that we are not reinventing the wheel for text
> manipulation
> >> as the extractor config exists already and we are doing a similar thing
> in
> >> the flatfile loader (in fact, the code is reused and merely extended).
> >> Transformation operations are already supported in our codebase in the
> >> extractor config, this PR has just added some hooks for stateful
> >> operations.
> >>
> >> Furthermore, we will need a configuration object to pass to the REST
> call
> >> if we are ever to create a UI around importing data into hbase or
> creating
> >> these summary objects.
> >>
> >> Regarding your example:
> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed 

[GitHub] metron issue #873: METRON-1367 Stellar should have some instrumentation of f...

2018-01-03 Thread ottobackwards
Github user ottobackwards commented on the issue:

https://github.com/apache/metron/pull/873
  
This has been submitted to Commons Lang : 
https://github.com/apache/commons-lang/pull/311
The version there is refactored and also java 7 comparable ( so I had to 
lose Optional and Lambdas ).

I will port over the changes later, don't let it stop your review ;)



---


[GitHub] metron issue #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread ottobackwards
Github user ottobackwards commented on the issue:

https://github.com/apache/metron/pull/884
  
This is quite a bit to go over can I trade you a similarly large review 
task?


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159496039
  
--- Diff: metron-stellar/stellar-common/README.md ---
@@ -1377,7 +1377,7 @@ IS_EMAIL
 To run the Stellar Shell directly from the Metron source code, run a 
command like the following.  Ensure that Metron has already been built and 
installed with `mvn clean install -DskipTests`.
 ```
 $ mvn exec:java \
-   -Dexec.mainClass="org.apache.metron.stellar.common.shell.StellarShell" \
+   
-Dexec.mainClass="org.apache.metron.stellar.common.shell.cli.StellarShell" \
--- End diff --

`StellarShell`, the main driver class for the CLI-based REPL, was moved to 
its own package since it is only used by the CLI-based REPL.   This separates 
it from the other core classes that are used by **both** the Zeppelin and 
CLI-based REPLs.


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159496645
  
--- Diff: 
metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/shell/DefaultStellarShellExecutor.java
 ---
@@ -0,0 +1,398 @@
+/*
+ *
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+package org.apache.metron.stellar.common.shell;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+import com.google.common.collect.Maps;
+import org.apache.commons.collections.map.UnmodifiableMap;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.curator.framework.CuratorFramework;
+import org.apache.metron.stellar.common.StellarProcessor;
+import org.apache.metron.stellar.common.configuration.ConfigurationsUtils;
+import 
org.apache.metron.stellar.common.shell.StellarExecutionListeners.FunctionDefinedListener;
+import 
org.apache.metron.stellar.common.shell.StellarExecutionListeners.SpecialDefinedListener;
+import 
org.apache.metron.stellar.common.shell.StellarExecutionListeners.VariableDefinedListener;
+import org.apache.metron.stellar.common.shell.specials.AssignmentCommand;
+import org.apache.metron.stellar.common.shell.specials.Comment;
+import org.apache.metron.stellar.common.shell.specials.DocCommand;
+import org.apache.metron.stellar.common.shell.specials.MagicDefineGlobal;
+import org.apache.metron.stellar.common.shell.specials.MagicListFunctions;
+import org.apache.metron.stellar.common.shell.specials.MagicListGlobals;
+import org.apache.metron.stellar.common.shell.specials.MagicListVariables;
+import org.apache.metron.stellar.common.shell.specials.MagicUndefineGlobal;
+import org.apache.metron.stellar.common.shell.specials.QuitCommand;
+import org.apache.metron.stellar.common.shell.specials.SpecialCommand;
+import org.apache.metron.stellar.common.utils.JSONUtils;
+import org.apache.metron.stellar.dsl.Context;
+import org.apache.metron.stellar.dsl.MapVariableResolver;
+import org.apache.metron.stellar.dsl.StellarFunctionInfo;
+import org.apache.metron.stellar.dsl.StellarFunctions;
+import org.apache.metron.stellar.dsl.VariableResolver;
+import org.apache.metron.stellar.dsl.functions.resolver.FunctionResolver;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.ByteArrayInputStream;
+import java.lang.invoke.MethodHandles;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Properties;
+
+import static 
org.apache.metron.stellar.common.configuration.ConfigurationsUtils.readGlobalConfigBytesFromZookeeper;
+import static 
org.apache.metron.stellar.common.shell.StellarShellResult.noop;
+import static 
org.apache.metron.stellar.common.shell.StellarShellResult.error;
+import static 
org.apache.metron.stellar.common.shell.StellarShellResult.success;
+import static 
org.apache.metron.stellar.dsl.Context.Capabilities.GLOBAL_CONFIG;
+import static 
org.apache.metron.stellar.dsl.Context.Capabilities.STELLAR_CONFIG;
+import static 
org.apache.metron.stellar.dsl.Context.Capabilities.ZOOKEEPER_CLIENT;
+
+/**
+ * Default implementation of a StellarShellExecutor.
+ */
+public class DefaultStellarShellExecutor implements StellarShellExecutor {
+
+  private static final Logger LOG =  
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+  public static final String SHELL_VARIABLES = "shellVariables";
+
+  /**
+   * The variables known by Stellar.
+   */
+  private Map variables;
+
+  /**
+   * The function resolver.
+   */
+  private FunctionResolver functionResolver;
+
+  /**
+   * A Zookeeper client. Only defined if given a valid Zookeeper URL.
+   */
+  private Optional zkClient;
+
+  /**
+   * A registry of all special 

[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159498766
  
--- Diff: 
metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/shell/specials/DocCommand.java
 ---
@@ -0,0 +1,105 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+package org.apache.metron.stellar.common.shell.specials;
+
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.stellar.common.shell.StellarShellExecutor;
+import org.apache.metron.stellar.common.shell.StellarResult;
+import org.apache.metron.stellar.dsl.StellarFunctionInfo;
+
+import java.util.Optional;
+import java.util.Spliterator;
+import java.util.function.Function;
+import java.util.stream.StreamSupport;
+
+import static org.apache.metron.stellar.common.shell.StellarResult.error;
+import static org.apache.metron.stellar.common.shell.StellarResult.success;
+
+/**
+ * A special command that allows a user to request doc string
+ * about a Stellar function.
+ *
+ * For example `?TO_STRING` will output the docs for the function 
`TO_STRING`
+ */
+public class DocCommand implements SpecialCommand {
--- End diff --

Doc strings are not part of the core language.


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159499691
  
--- Diff: 
metron-stellar/stellar-zeppelin/src/main/java/org/apache/metron/stellar/zeppelin/StellarInterpreter.java
 ---
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.metron.stellar.zeppelin;
+
+import org.apache.commons.lang3.exception.ExceptionUtils;
+import org.apache.metron.stellar.common.shell.DefaultStellarAutoCompleter;
+import org.apache.metron.stellar.common.shell.DefaultStellarShellExecutor;
+import org.apache.metron.stellar.common.shell.StellarAutoCompleter;
+import org.apache.metron.stellar.common.shell.StellarShellExecutor;
+import org.apache.metron.stellar.common.shell.StellarResult;
+import org.apache.metron.stellar.common.utils.ConversionUtils;
+import org.apache.zeppelin.interpreter.Interpreter;
+import org.apache.zeppelin.interpreter.InterpreterContext;
+import org.apache.zeppelin.interpreter.InterpreterResult;
+import org.apache.zeppelin.interpreter.thrift.InterpreterCompletion;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.lang.invoke.MethodHandles;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Optional;
+import java.util.Properties;
+
+import static org.apache.zeppelin.interpreter.InterpreterResult.Code.ERROR;
+import static 
org.apache.zeppelin.interpreter.InterpreterResult.Code.SUCCESS;
+import static org.apache.zeppelin.interpreter.InterpreterResult.Type.TEXT;
+
+/**
+ * A Zeppelin Interpreter for Stellar.
+ */
+public class StellarInterpreter extends Interpreter {
--- End diff --

This is what allows us to run Stellar in a Notebook.


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159497380
  
--- Diff: 
metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/shell/StellarExecutionNotifier.java
 ---
@@ -0,0 +1,44 @@
+/*
+ *
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+package org.apache.metron.stellar.common.shell;
+
+/**
+ * Notifies listeners when events occur during the execution of Stellar 
expressions.
+ */
+public interface StellarExecutionNotifier {
+
+  /**
+   * Add a listener that will be notified when a magic command is defined.
+   * @param listener The listener to notify.
+   */
+  void addSpecialListener(StellarExecutionListeners.SpecialDefinedListener 
listener);
--- End diff --

Our `DefaultStellarShellExecutor` is a `StellarExecutionNotifier` as it is 
able to notify event listeners when variables, functions or specials are 
defined.


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159498954
  
--- Diff: 
metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/shell/specials/QuitCommand.java
 ---
@@ -0,0 +1,51 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+package org.apache.metron.stellar.common.shell.specials;
+
+import org.apache.metron.stellar.common.shell.StellarShellExecutor;
+import org.apache.metron.stellar.common.shell.StellarResult;
+
+import java.util.function.Function;
+
+import static 
org.apache.metron.stellar.common.shell.StellarResult.terminate;
+
+/**
+ * A special command that allows the user to 'quit' their REPL session.
+ *
+ *quit
+ */
+public class QuitCommand implements SpecialCommand {
--- End diff --

This is what allows a user to execute `quit` within the REPL.  Again, not 
part of core Stellar.


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159498085
  
--- Diff: 
metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/shell/StellarResult.java
 ---
@@ -0,0 +1,185 @@
+/*
+ *
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+package org.apache.metron.stellar.common.shell;
+
+import java.util.Optional;
+
+/**
+ * The result of executing a Stellar expression within a 
StellarShellExecutor.
+ */
+public class StellarResult {
--- End diff --

Instead of the Stellar executor just returning an Object like before (which 
doesn't tell me much about whether the operations was successful or not) I 
needed a more descriptive result like this.  

This is how the CLI and Zeppelin REPLs determine whether the Stellar 
expression was executed successfully or not.  They need to perform different 
actions based on success, failure or even if there is a termination request.


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159495486
  
--- Diff: 
metron-platform/metron-management/src/main/java/org/apache/metron/management/ShellFunctions.java
 ---
@@ -90,19 +95,19 @@ public Object apply(List args) {
 @Override
 public Object apply(List args, Context context) throws 
ParseException {
 
-  Map variables = (Map) 
context.getCapability(StellarExecutor.SHELL_VARIABLES).get();
+  Map variables = getVariables(context);
   String[] headers = {"VARIABLE", "VALUE", "EXPRESSION"};
   String[][] data = new String[variables.size()][3];
   int wordWrap = -1;
   if(args.size() > 0) {
 wordWrap = ConversionUtils.convert(args.get(0), Integer.class);
   }
   int i = 0;
-  for(Map.Entry kv : 
variables.entrySet()) {
-StellarExecutor.VariableResult result = kv.getValue();
+  for(Map.Entry kv : variables.entrySet()) {
+VariableResult result = kv.getValue();
 data[i++] = new String[] { toWrappedString(kv.getKey().toString(), 
wordWrap)
  , toWrappedString(result.getResult(), 
wordWrap)
- , toWrappedString(result.getExpression(), 
wordWrap)
+ , 
toWrappedString(result.getExpression().get(), wordWrap)
--- End diff --

The VariableResult.expression field is now optional.  We are not always 
know the expression that resulted in a value.  The ShellFunctions just needed 
updated to treat this as an Optional.


---


[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r15947
  
--- Diff: 
metron-stellar/stellar-zeppelin/src/main/java/org/apache/metron/stellar/zeppelin/StellarInterpreter.java
 ---
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.metron.stellar.zeppelin;
+
+import org.apache.commons.lang3.exception.ExceptionUtils;
+import org.apache.metron.stellar.common.shell.DefaultStellarAutoCompleter;
+import org.apache.metron.stellar.common.shell.DefaultStellarShellExecutor;
+import org.apache.metron.stellar.common.shell.StellarAutoCompleter;
+import org.apache.metron.stellar.common.shell.StellarShellExecutor;
+import org.apache.metron.stellar.common.shell.StellarResult;
+import org.apache.metron.stellar.common.utils.ConversionUtils;
+import org.apache.zeppelin.interpreter.Interpreter;
+import org.apache.zeppelin.interpreter.InterpreterContext;
+import org.apache.zeppelin.interpreter.InterpreterResult;
+import org.apache.zeppelin.interpreter.thrift.InterpreterCompletion;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.lang.invoke.MethodHandles;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Optional;
+import java.util.Properties;
+
+import static org.apache.zeppelin.interpreter.InterpreterResult.Code.ERROR;
+import static 
org.apache.zeppelin.interpreter.InterpreterResult.Code.SUCCESS;
+import static org.apache.zeppelin.interpreter.InterpreterResult.Type.TEXT;
+
+/**
+ * A Zeppelin Interpreter for Stellar.
+ */
+public class StellarInterpreter extends Interpreter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
+  /**
+   * Executes the Stellar expressions.
+   *
+   * Zeppelin will handle isolation and how the same executor is or is not 
used across
+   * multiple notebooks.  This is configurable by the user.
+   *
+   * See 
https://zeppelin.apache.org/docs/latest/manual/interpreters.html#interpreter-binding-mode.
+   */
+  private StellarShellExecutor executor;
+
+  /**
+   * Handles auto-completion for Stellar expressions.
+   */
+  private StellarAutoCompleter autoCompleter;
+
+  public StellarInterpreter(Properties properties) {
+super(properties);
+this.autoCompleter = new DefaultStellarAutoCompleter();
+  }
+
+  public void open() {
+try {
+  executor = createExecutor();
+
+} catch (Exception e) {
+  LOG.error("Unable to create a StellarShellExecutor", e);
+}
+  }
+
+  public void close() {
+// nothing to do
+  }
+
+  public StellarShellExecutor createExecutor() throws Exception {
+
+Properties props = getProperty();
+StellarShellExecutor executor = new DefaultStellarShellExecutor(props, 
Optional.empty());
+
+// register the auto-completer to be notified
+executor.addSpecialListener((magic) -> 
autoCompleter.addCandidateFunction(magic.getCommand()));
+executor.addFunctionListener((fn) -> 
autoCompleter.addCandidateFunction(fn.getName()));
+executor.addVariableListener((name, val) -> 
autoCompleter.addCandidateVariable(name));
+
+executor.init();
+return executor;
+  }
+
+  public InterpreterResult interpret(String input, InterpreterContext 
context) {
+InterpreterResult result;
+
+try {
+  // execute the input
+  StellarResult stellarResult = executor.execute(input);
+
+  if(stellarResult.isSuccess()) {
+// on success - if no result, use a blank value
+Object value = stellarResult.getValue().orElse("");
+String text = ConversionUtils.convert(value, String.class);
+result = new InterpreterResult(SUCCESS, TEXT, text);
+
+  } else if(stellarResult.isError()) {
+   

[GitHub] metron pull request #884: METRON-1382 Run Stellar in a Zeppelin Notebook

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/884#discussion_r159497124
  
--- Diff: 
metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/shell/StellarExecutionListeners.java
 ---
@@ -0,0 +1,51 @@
+/*
+ *
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+package org.apache.metron.stellar.common.shell;
+
+import org.apache.metron.stellar.common.shell.specials.SpecialCommand;
+import org.apache.metron.stellar.dsl.StellarFunctionInfo;
+
+/**
+ * A listener will be notified about events that occur during the
+ * execution of Stellar expressions.
+ */
+public class StellarExecutionListeners {
+
+  /**
+   * A listener that is notified when a function is defined.
+   */
+  public interface FunctionDefinedListener {
+void whenFunctionDefined(StellarFunctionInfo functionInfo);
+  }
--- End diff --

I used an event listener pattern so that external entities could get 
notified when things occur during the execution of Stellar.  For example, the 
auto-completer needs to know when a variable is defined.


---


Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Simon Elliston Ball
There is some really cool stuff happening here, if only I’d been allowed to see 
the lists over Christmas... :)

A few thoughts...

I like Otto’s generalisation of the problem to include specific local stellar 
objects in a cache loaded from a store (HDFS seems a natural, but not only 
place, maybe even a web service / local microservicey object provider!?) That 
said, I suspect that’s a good platform optimisation approach. Should we look at 
this as a separate piece of work given it extends beyond the scope of the 
summarisation concept and ultimately use it as a back-end to feed the 
summarising engine proposed here for the enrichment loader?

On the more specific use case, one think I would comment on is the 
configuration approach. The iteration loop (state_{init|update|merge} should be 
consistent with the way we handle things like the profiler config, since it’s 
the same approach to data handling. 

The other thing that seems to have crept in here is the interface to something 
like Spark, which again, I am really very very keen on seeing happen. That 
said, not sure how that would happen in this context, unless you’re talking 
about pushing to something like livy for example (eminently sensible for things 
like cross instance caching and faster RPC-ish access to an existing spark 
context which seem to be what Casey is driving at with the spark piece. 

To address the question of text manipulation in Stellar / metron enrichment 
ingest etc, we already have this outside of the context of the issues here. I 
would argue that yes, we don’t want too many paths for this, and that maybe our 
parser approach might be heavily related to text-based ingest. I would say the 
scope worth dealing with here though is not really text manipulation, but 
summarisation, which is not well served by existing CLI tools like awk / sed 
and friends.

Simon

> On 3 Jan 2018, at 15:48, Nick Allen  wrote:
> 
>> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> this will impact performance
> 
> What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> seems really high, unless I am not understanding something.
> 
> 
> 
> 
> 
> 
> On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella  wrote:
> 
>> Thanks for the feedback, Nick.
>> 
>> Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
>> 
>> I would argue that we are not reinventing the wheel for text manipulation
>> as the extractor config exists already and we are doing a similar thing in
>> the flatfile loader (in fact, the code is reused and merely extended).
>> Transformation operations are already supported in our codebase in the
>> extractor config, this PR has just added some hooks for stateful
>> operations.
>> 
>> Furthermore, we will need a configuration object to pass to the REST call
>> if we are ever to create a UI around importing data into hbase or creating
>> these summary objects.
>> 
>> Regarding your example:
>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>> 
>> I'm very sympathetic to this type of extension, but it has some issues:
>> 
>>   1. This implies a single-threaded addition to the bloom filter.
>>  1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
>>  think this will impact performance
>>  2. There's not a way to specify how to merge across threads if we do
>>  make a multithread command line option
>>   2. This restricts these kinds of operations to roles with heavy unix CLI
>>   knowledge, which isn't often the types of people who would be doing this
>>   type of operation
>>   3. What if we need two variables passed to stellar?
>>   4. This approach will be harder to move to Hadoop.  Eventually we will
>>   want to support data on HDFS being processed by Hadoop (similar to
>> flatfile
>>   loader), so instead of -m LOCAL being passed for the flatfile summarizer
>>   you'd pass -m SPARK and the processing would happen on the cluster
>>  1. This is particularly relevant in this case as it's a
>>  embarrassingly parallel problem in general
>> 
>> In summary, while this a CLI approach is attractive, I prefer the extractor
>> config solution because it is the solution with the smallest iteration
>> that:
>> 
>>   1. Reuses existing metron extraction infrastructure
>>   2. Provides the most solid base for the extensions that will be sorely
>>   needed soon (and will keep it in parity with the flatfile loader)
>>   3. Provides the most solid base for a future UI extension in the
>>   management UI to support both summarization and loading
>> 
>> 
>> 
>> 
>> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:
>> 
>>> First off, I really do like the typosquatting use case and a lot of what
>>> you have described.
>>> 
 We need a way to generate the summary sketches from flat data for this
>> to
 work.
 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Nick Allen
Oh, gotcha.  That makes sense.  Thanks for clarifying.

On Wed, Jan 3, 2018 at 12:15 PM, Casey Stella  wrote:

> It's actually many more than 1M.  There are 1M domains, each domain could
> have upwards of 300 - 1000 possible typosquatted domains.
>
> You will notice from
> https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter
> that we are not adding the domain to the bloom filter, we're adding each
> domain generated from DOMAIN_TYPOSQUAT to the bloom filter.  In fact, we
> would very specifically NOT want the base domain as that would not be an
> indication of typosquatting (going to google.com would be legit, going to
> goggle.com would not).
>
>
>
> On Wed, Jan 3, 2018 at 10:48 AM, Nick Allen  wrote:
>
> > > Even with 5 threads, it takes an hour for the full Alexa 1m, so I
> think
> > this will impact performance
> >
> > What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> > seems really high, unless I am not understanding something.
> >
> >
> >
> >
> >
> >
> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella 
> wrote:
> >
> > > Thanks for the feedback, Nick.
> > >
> > > Regarding "IMHO, I'd rather not reinvent the wheel for text
> > manipulation."
> > >
> > > I would argue that we are not reinventing the wheel for text
> manipulation
> > > as the extractor config exists already and we are doing a similar thing
> > in
> > > the flatfile loader (in fact, the code is reused and merely extended).
> > > Transformation operations are already supported in our codebase in the
> > > extractor config, this PR has just added some hooks for stateful
> > > operations.
> > >
> > > Furthermore, we will need a configuration object to pass to the REST
> call
> > > if we are ever to create a UI around importing data into hbase or
> > creating
> > > these summary objects.
> > >
> > > Regarding your example:
> > > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > >
> > > I'm very sympathetic to this type of extension, but it has some issues:
> > >
> > >1. This implies a single-threaded addition to the bloom filter.
> > >   1. Even with 5 threads, it takes an hour for the full alexa 1m,
> so
> > I
> > >   think this will impact performance
> > >   2. There's not a way to specify how to merge across threads if we
> > do
> > >   make a multithread command line option
> > >2. This restricts these kinds of operations to roles with heavy unix
> > CLI
> > >knowledge, which isn't often the types of people who would be doing
> > this
> > >type of operation
> > >3. What if we need two variables passed to stellar?
> > >4. This approach will be harder to move to Hadoop.  Eventually we
> will
> > >want to support data on HDFS being processed by Hadoop (similar to
> > > flatfile
> > >loader), so instead of -m LOCAL being passed for the flatfile
> > summarizer
> > >you'd pass -m SPARK and the processing would happen on the cluster
> > >   1. This is particularly relevant in this case as it's a
> > >   embarrassingly parallel problem in general
> > >
> > > In summary, while this a CLI approach is attractive, I prefer the
> > extractor
> > > config solution because it is the solution with the smallest iteration
> > > that:
> > >
> > >1. Reuses existing metron extraction infrastructure
> > >2. Provides the most solid base for the extensions that will be
> sorely
> > >needed soon (and will keep it in parity with the flatfile loader)
> > >3. Provides the most solid base for a future UI extension in the
> > >management UI to support both summarization and loading
> > >
> > >
> > >
> > >
> > > On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen 
> wrote:
> > >
> > > > First off, I really do like the typosquatting use case and a lot of
> > what
> > > > you have described.
> > > >
> > > > > We need a way to generate the summary sketches from flat data for
> > this
> > > to
> > > > > work.
> > > > > ​..​
> > > > >
> > > >
> > > > I took this quote directly from your use case.  Above is the point
> that
> > > I'd
> > > > like to discuss and what your proposed solutions center on.  This is
> > > what I
> > > > think you are trying to do, at least with PR #879
> > > > ...
> > > >
> > > > (Q) Can we repurpose Stellar functions so that they can operate on
> text
> > > > stored in a file system?
> > > >
> > > >
> > > > Whether we use the (1) Configuration or the (2) Function-based
> approach
> > > > that you described, fundamentally we are introducing new ways to
> > perform
> > > > text manipulation inside of Stellar.
> > > >
> > > > IMHO, I'd rather not reinvent the wheel for text manipulation.  It
> > would
> > > be
> > > > painful to implement and maintain a bunch of Stellar functions for
> 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Casey Stella
It's actually many more than 1M.  There are 1M domains, each domain could
have upwards of 300 - 1000 possible typosquatted domains.

You will notice from
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter
that we are not adding the domain to the bloom filter, we're adding each
domain generated from DOMAIN_TYPOSQUAT to the bloom filter.  In fact, we
would very specifically NOT want the base domain as that would not be an
indication of typosquatting (going to google.com would be legit, going to
goggle.com would not).



On Wed, Jan 3, 2018 at 10:48 AM, Nick Allen  wrote:

> > Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> this will impact performance
>
> What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> seems really high, unless I am not understanding something.
>
>
>
>
>
>
> On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella  wrote:
>
> > Thanks for the feedback, Nick.
> >
> > Regarding "IMHO, I'd rather not reinvent the wheel for text
> manipulation."
> >
> > I would argue that we are not reinventing the wheel for text manipulation
> > as the extractor config exists already and we are doing a similar thing
> in
> > the flatfile loader (in fact, the code is reused and merely extended).
> > Transformation operations are already supported in our codebase in the
> > extractor config, this PR has just added some hooks for stateful
> > operations.
> >
> > Furthermore, we will need a configuration object to pass to the REST call
> > if we are ever to create a UI around importing data into hbase or
> creating
> > these summary objects.
> >
> > Regarding your example:
> > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> >
> > I'm very sympathetic to this type of extension, but it has some issues:
> >
> >1. This implies a single-threaded addition to the bloom filter.
> >   1. Even with 5 threads, it takes an hour for the full alexa 1m, so
> I
> >   think this will impact performance
> >   2. There's not a way to specify how to merge across threads if we
> do
> >   make a multithread command line option
> >2. This restricts these kinds of operations to roles with heavy unix
> CLI
> >knowledge, which isn't often the types of people who would be doing
> this
> >type of operation
> >3. What if we need two variables passed to stellar?
> >4. This approach will be harder to move to Hadoop.  Eventually we will
> >want to support data on HDFS being processed by Hadoop (similar to
> > flatfile
> >loader), so instead of -m LOCAL being passed for the flatfile
> summarizer
> >you'd pass -m SPARK and the processing would happen on the cluster
> >   1. This is particularly relevant in this case as it's a
> >   embarrassingly parallel problem in general
> >
> > In summary, while this a CLI approach is attractive, I prefer the
> extractor
> > config solution because it is the solution with the smallest iteration
> > that:
> >
> >1. Reuses existing metron extraction infrastructure
> >2. Provides the most solid base for the extensions that will be sorely
> >needed soon (and will keep it in parity with the flatfile loader)
> >3. Provides the most solid base for a future UI extension in the
> >management UI to support both summarization and loading
> >
> >
> >
> >
> > On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:
> >
> > > First off, I really do like the typosquatting use case and a lot of
> what
> > > you have described.
> > >
> > > > We need a way to generate the summary sketches from flat data for
> this
> > to
> > > > work.
> > > > ​..​
> > > >
> > >
> > > I took this quote directly from your use case.  Above is the point that
> > I'd
> > > like to discuss and what your proposed solutions center on.  This is
> > what I
> > > think you are trying to do, at least with PR #879
> > > ...
> > >
> > > (Q) Can we repurpose Stellar functions so that they can operate on text
> > > stored in a file system?
> > >
> > >
> > > Whether we use the (1) Configuration or the (2) Function-based approach
> > > that you described, fundamentally we are introducing new ways to
> perform
> > > text manipulation inside of Stellar.
> > >
> > > IMHO, I'd rather not reinvent the wheel for text manipulation.  It
> would
> > be
> > > painful to implement and maintain a bunch of Stellar functions for text
> > > manipulation.  People already have a large number of tools available to
> > do
> > > this and everyone has their favorites.  People are resistant to
> learning
> > > something new when they already are familiar with another way to do the
> > > same thing.
> > >
> > > So then the question is, how else can we do this?  My suggestion is
> that
> > > rather than introducing text 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Nick Allen
> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
this will impact performance

What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
seems really high, unless I am not understanding something.






On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella  wrote:

> Thanks for the feedback, Nick.
>
> Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
>
> I would argue that we are not reinventing the wheel for text manipulation
> as the extractor config exists already and we are doing a similar thing in
> the flatfile loader (in fact, the code is reused and merely extended).
> Transformation operations are already supported in our codebase in the
> extractor config, this PR has just added some hooks for stateful
> operations.
>
> Furthermore, we will need a configuration object to pass to the REST call
> if we are ever to create a UI around importing data into hbase or creating
> these summary objects.
>
> Regarding your example:
> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>
> I'm very sympathetic to this type of extension, but it has some issues:
>
>1. This implies a single-threaded addition to the bloom filter.
>   1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
>   think this will impact performance
>   2. There's not a way to specify how to merge across threads if we do
>   make a multithread command line option
>2. This restricts these kinds of operations to roles with heavy unix CLI
>knowledge, which isn't often the types of people who would be doing this
>type of operation
>3. What if we need two variables passed to stellar?
>4. This approach will be harder to move to Hadoop.  Eventually we will
>want to support data on HDFS being processed by Hadoop (similar to
> flatfile
>loader), so instead of -m LOCAL being passed for the flatfile summarizer
>you'd pass -m SPARK and the processing would happen on the cluster
>   1. This is particularly relevant in this case as it's a
>   embarrassingly parallel problem in general
>
> In summary, while this a CLI approach is attractive, I prefer the extractor
> config solution because it is the solution with the smallest iteration
> that:
>
>1. Reuses existing metron extraction infrastructure
>2. Provides the most solid base for the extensions that will be sorely
>needed soon (and will keep it in parity with the flatfile loader)
>3. Provides the most solid base for a future UI extension in the
>management UI to support both summarization and loading
>
>
>
>
> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:
>
> > First off, I really do like the typosquatting use case and a lot of what
> > you have described.
> >
> > > We need a way to generate the summary sketches from flat data for this
> to
> > > work.
> > > ​..​
> > >
> >
> > I took this quote directly from your use case.  Above is the point that
> I'd
> > like to discuss and what your proposed solutions center on.  This is
> what I
> > think you are trying to do, at least with PR #879
> > ...
> >
> > (Q) Can we repurpose Stellar functions so that they can operate on text
> > stored in a file system?
> >
> >
> > Whether we use the (1) Configuration or the (2) Function-based approach
> > that you described, fundamentally we are introducing new ways to perform
> > text manipulation inside of Stellar.
> >
> > IMHO, I'd rather not reinvent the wheel for text manipulation.  It would
> be
> > painful to implement and maintain a bunch of Stellar functions for text
> > manipulation.  People already have a large number of tools available to
> do
> > this and everyone has their favorites.  People are resistant to learning
> > something new when they already are familiar with another way to do the
> > same thing.
> >
> > So then the question is, how else can we do this?  My suggestion is that
> > rather than introducing text manipulation tools inside of Stellar, we
> allow
> > people to use the text manipulation tools they already know, but with the
> > Stellar functions that we already have.  And the obvious way to tie those
> > two things together is the Unix pipeline.
> >
> > A quick, albeit horribly incomplete, example to flesh this out a bit more
> > based on the example you have in PR #879
> > .  This would allow me to
> > integrate Stellar with whatever external tools that I want.
> >
> > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella 
> wrote:
> >
> > > I'll start this discussion off with my idea around a 2nd step that is
> > more
> > > adaptable.  I propose the following set of 

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Casey Stella
Thanks for the feedback, Nick.

Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."

I would argue that we are not reinventing the wheel for text manipulation
as the extractor config exists already and we are doing a similar thing in
the flatfile loader (in fact, the code is reused and merely extended).
Transformation operations are already supported in our codebase in the
extractor config, this PR has just added some hooks for stateful operations.

Furthermore, we will need a configuration object to pass to the REST call
if we are ever to create a UI around importing data into hbase or creating
these summary objects.

Regarding your example:
$ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'

I'm very sympathetic to this type of extension, but it has some issues:

   1. This implies a single-threaded addition to the bloom filter.
  1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
  think this will impact performance
  2. There's not a way to specify how to merge across threads if we do
  make a multithread command line option
   2. This restricts these kinds of operations to roles with heavy unix CLI
   knowledge, which isn't often the types of people who would be doing this
   type of operation
   3. What if we need two variables passed to stellar?
   4. This approach will be harder to move to Hadoop.  Eventually we will
   want to support data on HDFS being processed by Hadoop (similar to flatfile
   loader), so instead of -m LOCAL being passed for the flatfile summarizer
   you'd pass -m SPARK and the processing would happen on the cluster
  1. This is particularly relevant in this case as it's a
  embarrassingly parallel problem in general

In summary, while this a CLI approach is attractive, I prefer the extractor
config solution because it is the solution with the smallest iteration that:

   1. Reuses existing metron extraction infrastructure
   2. Provides the most solid base for the extensions that will be sorely
   needed soon (and will keep it in parity with the flatfile loader)
   3. Provides the most solid base for a future UI extension in the
   management UI to support both summarization and loading




On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:

> First off, I really do like the typosquatting use case and a lot of what
> you have described.
>
> > We need a way to generate the summary sketches from flat data for this to
> > work.
> > ​..​
> >
>
> I took this quote directly from your use case.  Above is the point that I'd
> like to discuss and what your proposed solutions center on.  This is what I
> think you are trying to do, at least with PR #879
> ...
>
> (Q) Can we repurpose Stellar functions so that they can operate on text
> stored in a file system?
>
>
> Whether we use the (1) Configuration or the (2) Function-based approach
> that you described, fundamentally we are introducing new ways to perform
> text manipulation inside of Stellar.
>
> IMHO, I'd rather not reinvent the wheel for text manipulation.  It would be
> painful to implement and maintain a bunch of Stellar functions for text
> manipulation.  People already have a large number of tools available to do
> this and everyone has their favorites.  People are resistant to learning
> something new when they already are familiar with another way to do the
> same thing.
>
> So then the question is, how else can we do this?  My suggestion is that
> rather than introducing text manipulation tools inside of Stellar, we allow
> people to use the text manipulation tools they already know, but with the
> Stellar functions that we already have.  And the obvious way to tie those
> two things together is the Unix pipeline.
>
> A quick, albeit horribly incomplete, example to flesh this out a bit more
> based on the example you have in PR #879
> .  This would allow me to
> integrate Stellar with whatever external tools that I want.
>
> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>
>
>
>
>
>
>
>
> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella  wrote:
>
> > I'll start this discussion off with my idea around a 2nd step that is
> more
> > adaptable.  I propose the following set of stellar functions backed by
> > Spark in the metron-management project:
> >
> >- CSV_PARSE(location, separator?, columns?) : Constructs a Spark
> >Dataframe for reading the flatfile
> >- SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
> > dataframe
> >- SUMMARIZE(state_init, state_update, state_merge): Summarize the
> >dataframe using the lambda functions:
> >   - state_init - executed once per worker to initialize the state
> >   - state_update - executed once per row
> >   - 

[GitHub] metron pull request #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/metron/pull/869


---


[GitHub] metron issue #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread justinleet
Github user justinleet commented on the issue:

https://github.com/apache/metron/pull/869
  
I'm still +1 after the latest changes.  Thanks @nickwallen!


---


[GitHub] metron issue #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread anandsubbu
Github user anandsubbu commented on the issue:

https://github.com/apache/metron/pull/869
  
+1 (non-binding). Looks good @nickwallen !


---


[GitHub] metron pull request #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/869#discussion_r159434045
  
--- Diff: metron-deployment/packaging/ambari/metron-mpack/README.md ---
@@ -0,0 +1,127 @@
+
+
+This provides a Management Pack (MPack) extension for [Apache 
Ambari](https://ambari.apache.org/) that simplifies the provisioning, 
management and monitoring of Metron on clusters of any size.  
+
+This allows you to easily install Metron using a simple, guided process.  
This also allows you to monitor cluster health and even secure your cluster 
with kerberos.
+
+### Prerequisites
+
+* Ambari 2.4.2+
+
+* Installable Metron packages (either RPMs or DEBs) located in a 
repository on each host at `/localrepo`.
+
+* A [Node.js](https://nodejs.org/en/download/package-manager/) repository 
installed on the host running the Management and Alarm UI.
+
+### Quick Start
+
+1. Build the Metron MPack. Execute the following command from the 
project's root directory.
+```
+mvn clean package -Pmpack -DskipTests
+```
+
+1. This results in the Mpack being produced at the following location.
+```
+
metron-deployment/packaging/ambari/metron-mpack/target/metron_mpack-x.y.z.0.tar.gz
+```
+
+1. Copy the tarball to the host where Ambari Server is installed.
+
+1. Ensure that Ambari Server is stopped.
+
+1. Install the MPack.
+```
+ambari-server install-mpack --mpack=metron_mpack-x.y.z.0.tar.gz 
--verbose
+```
+
+1. Install the Metron packages (RPMs or DEBs) in a local repository on 
each host where a Metron component is installed.  By default, the repository is 
expected to exist at `/localrepo`.
+
+On hosts where only a Metron client is installed, the local repository 
must exist, but it does not need to contain Metron packages.  For example to 
create an empty repository for an RPM-based system, run the following commands.
+
+```
+yum install createrepo
+mkdir /localrepo
+cd /localrepo
+createrepo
+```
+
+1. Metron will now be available as an installable service within Ambari.  
+
+### Installation Notes
+
+The MPack will make all Metron services available in Ambari in the same 
manner as any other services in a stack.  These can be installed using Ambari's 
user interface using "Add Services" or during an initial cluster install.
+
+ Co-Location
+
+1. The Parsers, Enrichment, Indexing, and Profiler masters should be 
colocated on a host with a Kafka Broker.  This is necessary so that the correct 
Kafka topics can be created.
+
+1. The Enrichment and Profiler masters should be colocated on a host with 
an HBase client.  This is necessary so that the Enrichment, Threat Intel, and 
Profile tables can be created.
+
+This colocation is currently not enforced by Ambari and should be managed 
by either a Service or Stack advisor as an enhancement.
--- End diff --

I removed the collocation requirements from the docs as these are enforced 
by the Mpack itself.  Let me know if anyone disagrees with this.


---


[GitHub] metron pull request #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/869#discussion_r159433756
  
--- Diff: metron-deployment/packaging/ambari/metron-mpack/README.md ---
@@ -0,0 +1,127 @@
+
+
+This provides a Management Pack (MPack) extension for [Apache 
Ambari](https://ambari.apache.org/) that simplifies the provisioning, 
management and monitoring of Metron on clusters of any size.  
+
+This allows you to easily install Metron using a simple, guided process.  
This also allows you to monitor cluster health and even secure your cluster 
with kerberos.
+
+### Prerequisites
+
+* Ambari 2.4.2+
+
+* Installable Metron packages (either RPMs or DEBs) located in a 
repository on each host at `/localrepo`.
+
+* A [Node.js](https://nodejs.org/en/download/package-manager/) repository 
installed on the host running the Management and Alarm UI.
+
+### Quick Start
+
+1. Build the Metron MPack. Execute the following command from the 
project's root directory.
+```
+mvn clean package -Pmpack -DskipTests
+```
+
+1. This results in the Mpack being produced at the following location.
+```
+
metron-deployment/packaging/ambari/metron-mpack/target/metron_mpack-x.y.z.0.tar.gz
+```
+
+1. Copy the tarball to the host where Ambari Server is installed.
+
+1. Ensure that Ambari Server is stopped.
+
+1. Install the MPack.
+```
+ambari-server install-mpack --mpack=metron_mpack-x.y.z.0.tar.gz 
--verbose
+```
+
+1. Install the Metron packages (RPMs or DEBs) in a local repository on 
each host where a Metron component is installed.  By default, the repository is 
expected to exist at `/localrepo`.
+
+On hosts where only a Metron client is installed, the local repository 
must exist, but it does not need to contain Metron packages.  For example to 
create an empty repository for an RPM-based system, run the following commands.
+
+```
+yum install createrepo
+mkdir /localrepo
+cd /localrepo
+createrepo
+```
+
+1. Metron will now be available as an installable service within Ambari.  
+
+### Installation Notes
+
+The MPack will make all Metron services available in Ambari in the same 
manner as any other services in a stack.  These can be installed using Ambari's 
user interface using "Add Services" or during an initial cluster install.
+
+ Co-Location
+
+1. The Parsers, Enrichment, Indexing, and Profiler masters should be 
colocated on a host with a Kafka Broker.  This is necessary so that the correct 
Kafka topics can be created.
+
+1. The Enrichment and Profiler masters should be colocated on a host with 
an HBase client.  This is necessary so that the Enrichment, Threat Intel, and 
Profile tables can be created.
+
+This colocation is currently not enforced by Ambari and should be managed 
by either a Service or Stack advisor as an enhancement.
--- End diff --

I checked the service_advisor.py script and all of these documented 
collocation requirements are enforced, except one.

I created https://issues.apache.org/jira/browse/METRON-1387 to track one 
missing collocation requirement.


---


[GitHub] metron pull request #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/869#discussion_r159431190
  
--- Diff: metron-deployment/packaging/ambari/metron-mpack/README.md ---
@@ -0,0 +1,127 @@
+
+
+This provides a Management Pack (MPack) extension for [Apache 
Ambari](https://ambari.apache.org/) that simplifies the provisioning, 
management and monitoring of Metron on clusters of any size.  
+
+This allows you to easily install Metron using a simple, guided process.  
This also allows you to monitor cluster health and even secure your cluster 
with kerberos.
+
+### Prerequisites
+
+* Ambari 2.4.2+
+
+* Installable Metron packages (either RPMs or DEBs) located in a 
repository on each host at `/localrepo`.
+
+* A [Node.js](https://nodejs.org/en/download/package-manager/) repository 
installed on the host running the Management and Alarm UI.
+
+### Quick Start
+
+1. Build the Metron MPack. Execute the following command from the 
project's root directory.
+```
+mvn clean package -Pmpack -DskipTests
+```
+
+1. This results in the Mpack being produced at the following location.
+```
+
metron-deployment/packaging/ambari/metron-mpack/target/metron_mpack-x.y.z.0.tar.gz
+```
+
+1. Copy the tarball to the host where Ambari Server is installed.
+
+1. Ensure that Ambari Server is stopped.
+
+1. Install the MPack.
+```
+ambari-server install-mpack --mpack=metron_mpack-x.y.z.0.tar.gz 
--verbose
+```
+
+1. Install the Metron packages (RPMs or DEBs) in a local repository on 
each host where a Metron component is installed.  By default, the repository is 
expected to exist at `/localrepo`.
+
+On hosts where only a Metron client is installed, the local repository 
must exist, but it does not need to contain Metron packages.  For example to 
create an empty repository for an RPM-based system, run the following commands.
+
+```
+yum install createrepo
+mkdir /localrepo
+cd /localrepo
+createrepo
+```
+
+1. Metron will now be available as an installable service within Ambari.  
+
+### Installation Notes
+
+The MPack will make all Metron services available in Ambari in the same 
manner as any other services in a stack.  These can be installed using Ambari's 
user interface using "Add Services" or during an initial cluster install.
+
+ Co-Location
+
+1. The Parsers, Enrichment, Indexing, and Profiler masters should be 
colocated on a host with a Kafka Broker.  This is necessary so that the correct 
Kafka topics can be created.
+
+1. The Enrichment and Profiler masters should be colocated on a host with 
an HBase client.  This is necessary so that the Enrichment, Threat Intel, and 
Profile tables can be created.
+
+This colocation is currently not enforced by Ambari and should be managed 
by either a Service or Stack advisor as an enhancement.
--- End diff --

Question 2: Do we even need to document the collocation requirements since 
they are enforced in the code?


---


[GitHub] metron pull request #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/metron/pull/869#discussion_r159431002
  
--- Diff: metron-deployment/packaging/ambari/metron-mpack/README.md ---
@@ -0,0 +1,127 @@
+
+
+This provides a Management Pack (MPack) extension for [Apache 
Ambari](https://ambari.apache.org/) that simplifies the provisioning, 
management and monitoring of Metron on clusters of any size.  
+
+This allows you to easily install Metron using a simple, guided process.  
This also allows you to monitor cluster health and even secure your cluster 
with kerberos.
+
+### Prerequisites
+
+* Ambari 2.4.2+
+
+* Installable Metron packages (either RPMs or DEBs) located in a 
repository on each host at `/localrepo`.
+
+* A [Node.js](https://nodejs.org/en/download/package-manager/) repository 
installed on the host running the Management and Alarm UI.
+
+### Quick Start
+
+1. Build the Metron MPack. Execute the following command from the 
project's root directory.
+```
+mvn clean package -Pmpack -DskipTests
+```
+
+1. This results in the Mpack being produced at the following location.
+```
+
metron-deployment/packaging/ambari/metron-mpack/target/metron_mpack-x.y.z.0.tar.gz
+```
+
+1. Copy the tarball to the host where Ambari Server is installed.
+
+1. Ensure that Ambari Server is stopped.
+
+1. Install the MPack.
+```
+ambari-server install-mpack --mpack=metron_mpack-x.y.z.0.tar.gz 
--verbose
+```
+
+1. Install the Metron packages (RPMs or DEBs) in a local repository on 
each host where a Metron component is installed.  By default, the repository is 
expected to exist at `/localrepo`.
+
+On hosts where only a Metron client is installed, the local repository 
must exist, but it does not need to contain Metron packages.  For example to 
create an empty repository for an RPM-based system, run the following commands.
+
+```
+yum install createrepo
+mkdir /localrepo
+cd /localrepo
+createrepo
+```
+
+1. Metron will now be available as an installable service within Ambari.  
+
+### Installation Notes
+
+The MPack will make all Metron services available in Ambari in the same 
manner as any other services in a stack.  These can be installed using Ambari's 
user interface using "Add Services" or during an initial cluster install.
+
+ Co-Location
+
+1. The Parsers, Enrichment, Indexing, and Profiler masters should be 
colocated on a host with a Kafka Broker.  This is necessary so that the correct 
Kafka topics can be created.
+
+1. The Enrichment and Profiler masters should be colocated on a host with 
an HBase client.  This is necessary so that the Enrichment, Threat Intel, and 
Profile tables can be created.
+
+This colocation is currently not enforced by Ambari and should be managed 
by either a Service or Stack advisor as an enhancement.
--- End diff --

Good find @anandsubbu .  So there are a few places that say basically "we 
should force collocation with the service advisor", but indeed that is already 
done.  I will remove that from the docs.


---


[GitHub] metron pull request #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread anandsubbu
Github user anandsubbu commented on a diff in the pull request:

https://github.com/apache/metron/pull/869#discussion_r159430163
  
--- Diff: metron-deployment/packaging/ambari/metron-mpack/README.md ---
@@ -0,0 +1,127 @@
+
+
+This provides a Management Pack (MPack) extension for [Apache 
Ambari](https://ambari.apache.org/) that simplifies the provisioning, 
management and monitoring of Metron on clusters of any size.  
+
+This allows you to easily install Metron using a simple, guided process.  
This also allows you to monitor cluster health and even secure your cluster 
with kerberos.
+
+### Prerequisites
+
+* Ambari 2.4.2+
+
+* Installable Metron packages (either RPMs or DEBs) located in a 
repository on each host at `/localrepo`.
+
+* A [Node.js](https://nodejs.org/en/download/package-manager/) repository 
installed on the host running the Management and Alarm UI.
+
+### Quick Start
+
+1. Build the Metron MPack. Execute the following command from the 
project's root directory.
+```
+mvn clean package -Pmpack -DskipTests
+```
+
+1. This results in the Mpack being produced at the following location.
+```
+
metron-deployment/packaging/ambari/metron-mpack/target/metron_mpack-x.y.z.0.tar.gz
+```
+
+1. Copy the tarball to the host where Ambari Server is installed.
+
+1. Ensure that Ambari Server is stopped.
+
+1. Install the MPack.
+```
+ambari-server install-mpack --mpack=metron_mpack-x.y.z.0.tar.gz 
--verbose
+```
+
+1. Install the Metron packages (RPMs or DEBs) in a local repository on 
each host where a Metron component is installed.  By default, the repository is 
expected to exist at `/localrepo`.
+
+On hosts where only a Metron client is installed, the local repository 
must exist, but it does not need to contain Metron packages.  For example to 
create an empty repository for an RPM-based system, run the following commands.
+
+```
+yum install createrepo
+mkdir /localrepo
+cd /localrepo
+createrepo
+```
+
+1. Metron will now be available as an installable service within Ambari.  
+
+### Installation Notes
+
+The MPack will make all Metron services available in Ambari in the same 
manner as any other services in a stack.  These can be installed using Ambari's 
user interface using "Add Services" or during an initial cluster install.
+
+ Co-Location
+
+1. The Parsers, Enrichment, Indexing, and Profiler masters should be 
colocated on a host with a Kafka Broker.  This is necessary so that the correct 
Kafka topics can be created.
+
+1. The Enrichment and Profiler masters should be colocated on a host with 
an HBase client.  This is necessary so that the Enrichment, Threat Intel, and 
Profile tables can be created.
+
+This colocation is currently not enforced by Ambari and should be managed 
by either a Service or Stack advisor as an enhancement.
--- End diff --

In the "Assign Masters" page, if the Kafka broker, Parsers host co-location 
or Alerts/Mgt UI co-location requirements are not met, then Ambari indicated 
this using an error balloon like:

![error](https://user-images.githubusercontent.com/20395490/34523079-ddc7106e-f0bb-11e7-9763-472a482d20a2.png)

If one of the above were not met and the user clicks "Next", then Ambari 
presents a validation issues dialog:

![validation-issues](https://user-images.githubusercontent.com/20395490/34523091-f0c13762-f0bb-11e7-902b-84d43722c3d8.png)

Now, the user can always hit "Continue Anyway" in the above dialog and move 
on with the wizard. This way, Ambari does not enforce the co-location.

The other client related requirements are indicated as a warning icon 
(since the client installation selection happens in the next step of the 
wizard).

(Thank you for reading through my long winding comment)


---


[GitHub] metron pull request #869: METRON-1362 Improve Metron Deployment README

2018-01-03 Thread anandsubbu
Github user anandsubbu commented on a diff in the pull request:

https://github.com/apache/metron/pull/869#discussion_r159426522
  
--- Diff: metron-deployment/packaging/ambari/metron-mpack/README.md ---
@@ -0,0 +1,127 @@
+
+
+This provides a Management Pack (MPack) extension for [Apache 
Ambari](https://ambari.apache.org/) that simplifies the provisioning, 
management and monitoring of Metron on clusters of any size.  
+
+This allows you to easily install Metron using a simple, guided process.  
This also allows you to monitor cluster health and even secure your cluster 
with kerberos.
+
+### Prerequisites
+
+* Ambari 2.4.2+
+
+* Installable Metron packages (either RPMs or DEBs) located in a 
repository on each host at `/localrepo`.
+
+* A [Node.js](https://nodejs.org/en/download/package-manager/) repository 
installed on the host running the Management and Alarm UI.
+
+### Quick Start
+
+1. Build the Metron MPack. Execute the following command from the 
project's root directory.
+```
+mvn clean package -Pmpack -DskipTests
+```
+
+1. This results in the Mpack being produced at the following location.
+```
+
metron-deployment/packaging/ambari/metron-mpack/target/metron_mpack-x.y.z.0.tar.gz
+```
+
+1. Copy the tarball to the host where Ambari Server is installed.
+
+1. Ensure that Ambari Server is stopped.
+
+1. Install the MPack.
+```
+ambari-server install-mpack --mpack=metron_mpack-x.y.z.0.tar.gz 
--verbose
+```
+
+1. Install the Metron packages (RPMs or DEBs) in a local repository on 
each host where a Metron component is installed.  By default, the repository is 
expected to exist at `/localrepo`.
+
+On hosts where only a Metron client is installed, the local repository 
must exist, but it does not need to contain Metron packages.  For example to 
create an empty repository for an RPM-based system, run the following commands.
+
+```
+yum install createrepo
+mkdir /localrepo
+cd /localrepo
+createrepo
+```
+
+1. Metron will now be available as an installable service within Ambari.  
+
+### Installation Notes
+
+The MPack will make all Metron services available in Ambari in the same 
manner as any other services in a stack.  These can be installed using Ambari's 
user interface using "Add Services" or during an initial cluster install.
+
+ Co-Location
+
+1. The Parsers, Enrichment, Indexing, and Profiler masters should be 
colocated on a host with a Kafka Broker.  This is necessary so that the correct 
Kafka topics can be created.
+
+1. The Enrichment and Profiler masters should be colocated on a host with 
an HBase client.  This is necessary so that the Enrichment, Threat Intel, and 
Profile tables can be created.
+
--- End diff --

Just to give some more context on the co-location requirements...

The point no. 3 above is due to the way the 
[service_advisor](https://github.com/apache/metron/blob/master/metron-deployment/packaging/ambari/metron-mpack/src/main/resources/common-services/METRON/CURRENT/service_advisor.py#L78)
 is designed. This checks for all of the Parsers, Enrichment, Indexing and 
Profiler to be on the same host.


---