I can definitely give it a shot. A kickstart would be appreciated. Jom
On Tue, Jul 12, 2016, 17:17 James Sirota <[email protected]> wrote: > John, > > Just field METRON-318. Is this something you would like to work on? > Would you like help from us to get started? > > Thanks, > James > > 12.07.2016, 11:53, "[email protected]" <[email protected]>: > > Hi All, > > > > Has there been any additional discussion or development regarding this? I > > did take a brief look around the jira and didn't see anything regarding > > this, but I may have missed it. Thanks, > > > > Jon > > > > On Fri, Apr 15, 2016 at 2:01 PM Nick Allen <[email protected]> wrote: > > > >> I definitely agree that you need this level of understanding of your > >> cluster. It definitely could work the way that you describe. > >> > >> I was thinking of it slightly differently though. The metrics for this > >> purpose (understanding performance of existing cluster) should come > from > >> the actual sensors themselves. For example, I need to instrument the > >> packet capture process so that it kicks out time-series-ish metrics > that > >> you can monitor in a dashboard over time. > >> > >> On Fri, Apr 15, 2016 at 1:40 PM, [email protected] <[email protected]> > >> wrote: > >> > >> > However, it would be handy to have something like this perpetually > >> running > >> > so you know when to scale up/out/down/in a cluster. > >> > > >> > On Fri, Apr 15, 2016, 13:35 Nick Allen <[email protected]> wrote: > >> > > >> > > I think it is slightly different. I don't even want to install > minimal > >> > > Kafka infrastructure (Look ma, no Kafka!) > >> > > > >> > > The exact implementation would differ based on the data inputs > that you > >> > are > >> > > trying to measure, but for example... > >> > > > >> > > - To understand raw packet rates I would have a specialized sensor > >> > that > >> > > counts packets and size on the wire. It doesn't do anything more > >> than > >> > > that. > >> > > - To understand Netflow rates, it would watch for Netflow packets > >> and > >> > > count those. > >> > > - To understand sizing around application logs, a sensor would > watch > >> > for > >> > > Syslog packets and count those. > >> > > > >> > > The implementation would be more similar to raw packet capture with > >> some > >> > > DPI. No Hadoop-y components required. > >> > > > >> > > > >> > > > >> > > On Fri, Apr 15, 2016 at 1:10 PM, James Sirota < > [email protected] > >> > > >> > > wrote: > >> > > > >> > > > So this is exactly what I am proposing. Calculate the metrics on > the > >> > fly > >> > > > without landing any data in the cluster. The problem is that that > >> > > > enterprise data volumes are so large you can’t just point them > at a > >> > Java > >> > > or > >> > > > a C++ program or sensor. You either need an existing minimal > Kafka > >> > > > infrastructure to take that load or sample the data. > >> > > > > >> > > > Thanks, > >> > > > James > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > On 4/15/16, 9:54 AM, "Nick Allen" <[email protected]> wrote: > >> > > > > >> > > > >Or we have the assessment tool not actually land any data. The > >> > > assessment > >> > > > >tool becomes a 'sensor' in its own right. You just point the > input > >> > data > >> > > > >sets at the assessment tool, it builds metrics on the input (for > >> > > example: > >> > > > >count the number of packets per second) and then we use those > >> metrics > >> > to > >> > > > >estimate cluster size. > >> > > > > > >> > > > >On Wed, Apr 13, 2016 at 5:45 PM, James Sirota < > >> > [email protected]> > >> > > > >wrote: > >> > > > > > >> > > > >> That’s an excellent point. So I think there are three ways > >> forward. > >> > > > >> > >> > > > >> One is we can assume that there has to be at least a minimal > >> > > > >> infrastructure in place (at least a subset of Kafka and Storm > >> > > > resources) to > >> > > > >> run a full-scale assessment. If you point something that > blasts > >> > > > millions > >> > > > >> of messages per second at something like ActiveMQ you are > going to > >> > > blow > >> > > > >> up. So the infrastructure to at least receive these kinds of > >> > message > >> > > > >> volumes has to exist as a pre-requisite. There is no way to > get > >> > around > >> > > > that. > >> > > > >> > >> > > > >> The second approach I see is sampling. Sampling is a lot less > >> > precise > >> > > > and > >> > > > >> you can miss peaks that fall outside of your sampling windows. > >> But > >> > > the > >> > > > >> obvious benefit is that you don’t need a cluster to process > these > >> > > > streams. > >> > > > >> You can probably perform most of your calculations with a > >> > > multithreaded > >> > > > >> java program. Sampling poses a few design challenges. First, > >> where > >> > > do > >> > > > you > >> > > > >> sample? Do you sample on the sensor? (the implication here is > >> that > >> > we > >> > > > have > >> > > > >> to program some sort of sampling capability in our sensors) . > Do > >> you > >> > > > sample > >> > > > >> on transport? (maybe a Flume interceptor or NiFi processor). > >> There > >> > is > >> > > > also > >> > > > >> a question of what the sampling rate should be. Not knowing > >> > > statistical > >> > > > >> properties of a stream ahead of time it’s hard to make that > call. > >> > > > >> > >> > > > >> The third option I think is MR job. We can blast the data into > >> HDFS > >> > > and > >> > > > >> then go over it with MR to derive the metrics we are looking > for. > >> > > Then > >> > > > we > >> > > > >> don’t have to sample or setup expensive infrastructure to > receive > >> a > >> > > > deluge > >> > > > >> of data. But then we run into the chicken and the egg problem > >> that > >> > in > >> > > > >> order to size your HDFS you need to have data in HDFS. Ideally > >> you > >> > > > need to > >> > > > >> capture at least one full weeks worth of logs because patterns > >> > > > throughout > >> > > > >> the day as well as every day of the week have different > >> statistical > >> > > > >> properties. So you need off peak, on peak, weekdays and > weekends > >> to > >> > > > derive > >> > > > >> these stats in batch. > >> > > > >> > >> > > > >> Any other design ideas? > >> > > > >> > >> > > > >> Thanks, > >> > > > >> James > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> wrote: > >> > > > >> > >> > > > >> >If the tool starts at Kafka, the user would have to already > have > >> > > > committed > >> > > > >> >to the investment in the infrastructure and time to setup the > >> > sensors > >> > > > that > >> > > > >> >feed Kafka and Kafka itself. Maybe it would need to be > further > >> > > > upstream? > >> > > > >> >On Apr 13, 2016 1:05 PM, "James Sirota" < > [email protected] > >> > > >> > > > wrote: > >> > > > >> > > >> > > > >> >> Hi Goerge, > >> > > > >> >> > >> > > > >> >> This article defines micro-tuning of the existing cluster. > >> What > >> > I > >> > > am > >> > > > >> >> proposing is a level up from that. When you start with > Metron > >> > how > >> > > do > >> > > > >> you > >> > > > >> >> even know how many nodes you need? And of these nodes how > many > >> > do > >> > > > you > >> > > > >> >> allocate to Storm, indexing, storage? How much storage do > you > >> > > need? > >> > > > >> >> Tuning would be the next step in the process, but this tool > >> would > >> > > > answer > >> > > > >> >> more fundamental questions about what a Metron deployment > >> should > >> > > look > >> > > > >> like > >> > > > >> >> given the number of telemetries and retention policies of > the > >> > > > >> enterprise. > >> > > > >> >> > >> > > > >> >> The best way to get this data (in my opinion) is to have > some > >> > tool > >> > > > that > >> > > > >> we > >> > > > >> >> can plug into Metron’s point of ingest (kafka topics) and > run > >> > that > >> > > > for > >> > > > >> >> about a week or a month to be able to figure that out and > spit > >> > out > >> > > > these > >> > > > >> >> relevant metrics. Based on these metrics we can figure out > the > >> > > > >> fundamental > >> > > > >> >> things about what metron should look like. Tuning would be > the > >> > > next > >> > > > >> step. > >> > > > >> >> > >> > > > >> >> Thanks, > >> > > > >> >> James > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> On 4/13/16, 9:52 AM, "George Vetticaden" < > >> > > > [email protected]> > >> > > > >> >> wrote: > >> > > > >> >> > >> > > > >> >> >I have used the following Kafka and Storm Best Practices > guide > >> > at > >> > > > >> numerous > >> > > > >> >> >customer implementations. > >> > > > >> >> > > >> > > > >> >> > >> > > > >> > >> > > > > >> > > > >> > > >> > https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b > >> > > > >> >> >est-practices-guide.html > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> >We need to have something similar and prescriptive for > Metron > >> > > based > >> > > > on: > >> > > > >> >> >1. What data sources are we enabling > >> > > > >> >> >2. What enrichment services are we enabling > >> > > > >> >> >3. What threat intel services are we enabling > >> > > > >> >> >4. What are we indexing into Solr/Elastic and how long > >> > > > >> >> >5. What are we persisting into HDFS.. > >> > > > >> >> > > >> > > > >> >> >Ideally, the The metron assessment tool combined with an > >> > > > introspection > >> > > > >> of > >> > > > >> >> >the user’s ansible configuration should drive what ambari > >> > > blueprint > >> > > > >> type > >> > > > >> >> >and configuration should be used when the cluster is spun > up > >> and > >> > > the > >> > > > >> storm > >> > > > >> >> >topology is deployed. > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> >-- > >> > > > >> >> >George VetticadenPrincipal, COE > >> > > > >> >> >[email protected] > >> > > > >> >> >(630) 909-9138 > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> >On 4/13/16, 11:40 AM, "George Vetticaden" < > >> > > > [email protected] > >> > > > >> > > >> > > > >> >> >wrote: > >> > > > >> >> > > >> > > > >> >> >>+ 1 to James suggestion. > >> > > > >> >> >>We also need to consider not just the data volume and > storage > >> > > > >> >> requirements > >> > > > >> >> >>for proper cluster sizing but also processing > requirements as > >> > > well. > >> > > > >> Given > >> > > > >> >> >>that in the new architecture, we have moved to single > >> > enrichment > >> > > > >> topology > >> > > > >> >> >>that will support all data sources, proper sizing of the > >> > > enrichment > >> > > > >> >> >>topology will be even more crucial to maintain SLAs and > HA > >> > > > >> requirements. > >> > > > >> >> >>The following key questions will apply to each parser > >> topology > >> > > and > >> > > > >> single > >> > > > >> >> >>enrichment topology > >> > > > >> >> >> > >> > > > >> >> >>1. Number of workers? > >> > > > >> >> >>2. Number of workers per machine? > >> > > > >> >> >>3. Size of each workers (in memory)? > >> > > > >> >> >>4. Supervisor memory settings > >> > > > >> >> >> > >> > > > >> >> >>The assessment tool should also be used to size > topologies > >> > > > correctly > >> > > > >> as > >> > > > >> >> >>well. > >> > > > >> >> >> > >> > > > >> >> >>Tuning Kafka, Hbase and Solr/Elastic should also be > governed > >> by > >> > > the > >> > > > >> >> Metron > >> > > > >> >> >>assessment tool. > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >>-- > >> > > > >> >> >>George Vetticaden > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >>On 4/13/16, 11:28 AM, "James Sirota" < > >> [email protected]> > >> > > > wrote: > >> > > > >> >> >> > >> > > > >> >> >>>Prior to adoption of Metron each adopting entity needs > to > >> > > > guesstimate > >> > > > >> >> >>>it¹s data volume and data storage requirements so they > can > >> > size > >> > > > their > >> > > > >> >> >>>cluster properly. I propose a creation of an assessment > >> tool > >> > > that > >> > > > >> can > >> > > > >> >> >>>plug in to a Kafka topic for a given telemetry and over > time > >> > > > produce > >> > > > >> >> >>>statistics for ingest volumes and storage requirement. > The > >> > idea > >> > > > is > >> > > > >> that > >> > > > >> >> >>>prior to adoption of Metron someone can set up all the > feeds > >> > and > >> > > > >> kafka > >> > > > >> >> >>>topics, but instead of deploying Metron right away they > >> would > >> > > > deploy > >> > > > >> >> this > >> > > > >> >> >>>tool. This tool would then produce statistics for data > >> > > > >> ingest/storage > >> > > > >> >> >>>requirement, and all relevant information needed for > cluster > >> > > > sizing. > >> > > > >> >> >>> > >> > > > >> >> >>>Some of the metrics that can be recorded are: > >> > > > >> >> >>> > >> > > > >> >> >>> * Number of system events per second (average, max, > >> mean, > >> > > > >> standard > >> > > > >> >> >>>dev) > >> > > > >> >> >>> * Message size (average, max, mean, standard dev) > >> > > > >> >> >>> * Average number of peaks > >> > > > >> >> >>> * Duration of peaks (average, max, mean, standard dev) > >> > > > >> >> >>> > >> > > > >> >> >>>If the parser for a telemetry exist the tool can produce > >> > > > additional > >> > > > >> >> >>>statistics > >> > > > >> >> >>> > >> > > > >> >> >>> * Number of keys/fields parsed (average, max, mean, > >> > standard > >> > > > dev) > >> > > > >> >> >>> * Length of field parsed (average, max, mean, standard > >> > dev) > >> > > > >> >> >>> * Length of key parsed (average, max, mean, standard > >> dev) > >> > > > >> >> >>> > >> > > > >> >> >>>The tool can run for a week or a month and produce these > >> kinds > >> > > of > >> > > > >> >> >>>statistics. Then once the statistics are available we > can > >> > come > >> > > up > >> > > > >> with > >> > > > >> >> a > >> > > > >> >> >>>guidance documentation of recommended cluster setup. > >> > Otherwise > >> > > > it¹s > >> > > > >> >> hard > >> > > > >> >> >>>to properly size a cluster and setup streaming > parallelism > >> not > >> > > > >> knowing > >> > > > >> >> >>>these metrics. > >> > > > >> >> >>> > >> > > > >> >> >>> > >> > > > >> >> >>>Thoughts/ideas? > >> > > > >> >> >>> > >> > > > >> >> >>>Thanks, > >> > > > >> >> >>>James > >> > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> > >> > > > >> > >> > > > > > >> > > > > > >> > > > > > >> > > > >-- > >> > > > >Nick Allen <[email protected]> > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Nick Allen <[email protected]> > >> > > > >> > -- > >> > > >> > Jon > >> > > >> > >> -- > >> Nick Allen <[email protected]> > > -- > > > > Jon > > ------------------- > Thank you, > > James Sirota > PPMC- Apache Metron (Incubating) > jsirota AT apache DOT org > -- Jon
