Per the Apache Way it would be desirable to put forth an architecture proposal together to have the community take a look at it before implementing. I would propose to have a simple storm topology that attaches to a kafka topic and records statistics such as # of messages and total throughput for n-sized time bins. What specific requirements do you have in mind?
Thanks, James 12.07.2016, 14:41, "[email protected]" <[email protected]>: > I can definitely give it a shot. A kickstart would be appreciated. > > Jom > > On Tue, Jul 12, 2016, 17:17 James Sirota <[email protected]> wrote: > >> John, >> >> Just field METRON-318. Is this something you would like to work on? >> Would you like help from us to get started? >> >> Thanks, >> James >> >> 12.07.2016, 11:53, "[email protected]" <[email protected]>: >> > Hi All, >> > >> > Has there been any additional discussion or development regarding this? I >> > did take a brief look around the jira and didn't see anything regarding >> > this, but I may have missed it. Thanks, >> > >> > Jon >> > >> > On Fri, Apr 15, 2016 at 2:01 PM Nick Allen <[email protected]> wrote: >> > >> >> I definitely agree that you need this level of understanding of your >> >> cluster. It definitely could work the way that you describe. >> >> >> >> I was thinking of it slightly differently though. The metrics for this >> >> purpose (understanding performance of existing cluster) should come >> from >> >> the actual sensors themselves. For example, I need to instrument the >> >> packet capture process so that it kicks out time-series-ish metrics >> that >> >> you can monitor in a dashboard over time. >> >> >> >> On Fri, Apr 15, 2016 at 1:40 PM, [email protected] <[email protected]> >> >> wrote: >> >> >> >> > However, it would be handy to have something like this perpetually >> >> running >> >> > so you know when to scale up/out/down/in a cluster. >> >> > >> >> > On Fri, Apr 15, 2016, 13:35 Nick Allen <[email protected]> wrote: >> >> > >> >> > > I think it is slightly different. I don't even want to install >> minimal >> >> > > Kafka infrastructure (Look ma, no Kafka!) >> >> > > >> >> > > The exact implementation would differ based on the data inputs >> that you >> >> > are >> >> > > trying to measure, but for example... >> >> > > >> >> > > - To understand raw packet rates I would have a specialized sensor >> >> > that >> >> > > counts packets and size on the wire. It doesn't do anything more >> >> than >> >> > > that. >> >> > > - To understand Netflow rates, it would watch for Netflow packets >> >> and >> >> > > count those. >> >> > > - To understand sizing around application logs, a sensor would >> watch >> >> > for >> >> > > Syslog packets and count those. >> >> > > >> >> > > The implementation would be more similar to raw packet capture with >> >> some >> >> > > DPI. No Hadoop-y components required. >> >> > > >> >> > > >> >> > > >> >> > > On Fri, Apr 15, 2016 at 1:10 PM, James Sirota < >> [email protected] >> >> > >> >> > > wrote: >> >> > > >> >> > > > So this is exactly what I am proposing. Calculate the metrics on >> the >> >> > fly >> >> > > > without landing any data in the cluster. The problem is that that >> >> > > > enterprise data volumes are so large you can’t just point them >> at a >> >> > Java >> >> > > or >> >> > > > a C++ program or sensor. You either need an existing minimal >> Kafka >> >> > > > infrastructure to take that load or sample the data. >> >> > > > >> >> > > > Thanks, >> >> > > > James >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > On 4/15/16, 9:54 AM, "Nick Allen" <[email protected]> wrote: >> >> > > > >> >> > > > >Or we have the assessment tool not actually land any data. The >> >> > > assessment >> >> > > > >tool becomes a 'sensor' in its own right. You just point the >> input >> >> > data >> >> > > > >sets at the assessment tool, it builds metrics on the input (for >> >> > > example: >> >> > > > >count the number of packets per second) and then we use those >> >> metrics >> >> > to >> >> > > > >estimate cluster size. >> >> > > > > >> >> > > > >On Wed, Apr 13, 2016 at 5:45 PM, James Sirota < >> >> > [email protected]> >> >> > > > >wrote: >> >> > > > > >> >> > > > >> That’s an excellent point. So I think there are three ways >> >> forward. >> >> > > > >> >> >> > > > >> One is we can assume that there has to be at least a minimal >> >> > > > >> infrastructure in place (at least a subset of Kafka and Storm >> >> > > > resources) to >> >> > > > >> run a full-scale assessment. If you point something that >> blasts >> >> > > > millions >> >> > > > >> of messages per second at something like ActiveMQ you are >> going to >> >> > > blow >> >> > > > >> up. So the infrastructure to at least receive these kinds of >> >> > message >> >> > > > >> volumes has to exist as a pre-requisite. There is no way to >> get >> >> > around >> >> > > > that. >> >> > > > >> >> >> > > > >> The second approach I see is sampling. Sampling is a lot less >> >> > precise >> >> > > > and >> >> > > > >> you can miss peaks that fall outside of your sampling windows. >> >> But >> >> > > the >> >> > > > >> obvious benefit is that you don’t need a cluster to process >> these >> >> > > > streams. >> >> > > > >> You can probably perform most of your calculations with a >> >> > > multithreaded >> >> > > > >> java program. Sampling poses a few design challenges. First, >> >> where >> >> > > do >> >> > > > you >> >> > > > >> sample? Do you sample on the sensor? (the implication here is >> >> that >> >> > we >> >> > > > have >> >> > > > >> to program some sort of sampling capability in our sensors) . >> Do >> >> you >> >> > > > sample >> >> > > > >> on transport? (maybe a Flume interceptor or NiFi processor). >> >> There >> >> > is >> >> > > > also >> >> > > > >> a question of what the sampling rate should be. Not knowing >> >> > > statistical >> >> > > > >> properties of a stream ahead of time it’s hard to make that >> call. >> >> > > > >> >> >> > > > >> The third option I think is MR job. We can blast the data into >> >> HDFS >> >> > > and >> >> > > > >> then go over it with MR to derive the metrics we are looking >> for. >> >> > > Then >> >> > > > we >> >> > > > >> don’t have to sample or setup expensive infrastructure to >> receive >> >> a >> >> > > > deluge >> >> > > > >> of data. But then we run into the chicken and the egg problem >> >> that >> >> > in >> >> > > > >> order to size your HDFS you need to have data in HDFS. Ideally >> >> you >> >> > > > need to >> >> > > > >> capture at least one full weeks worth of logs because patterns >> >> > > > throughout >> >> > > > >> the day as well as every day of the week have different >> >> statistical >> >> > > > >> properties. So you need off peak, on peak, weekdays and >> weekends >> >> to >> >> > > > derive >> >> > > > >> these stats in batch. >> >> > > > >> >> >> > > > >> Any other design ideas? >> >> > > > >> >> >> > > > >> Thanks, >> >> > > > >> James >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> wrote: >> >> > > > >> >> >> > > > >> >If the tool starts at Kafka, the user would have to already >> have >> >> > > > committed >> >> > > > >> >to the investment in the infrastructure and time to setup the >> >> > sensors >> >> > > > that >> >> > > > >> >feed Kafka and Kafka itself. Maybe it would need to be >> further >> >> > > > upstream? >> >> > > > >> >On Apr 13, 2016 1:05 PM, "James Sirota" < >> [email protected] >> >> > >> >> > > > wrote: >> >> > > > >> > >> >> > > > >> >> Hi Goerge, >> >> > > > >> >> >> >> > > > >> >> This article defines micro-tuning of the existing cluster. >> >> What >> >> > I >> >> > > am >> >> > > > >> >> proposing is a level up from that. When you start with >> Metron >> >> > how >> >> > > do >> >> > > > >> you >> >> > > > >> >> even know how many nodes you need? And of these nodes how >> many >> >> > do >> >> > > > you >> >> > > > >> >> allocate to Storm, indexing, storage? How much storage do >> you >> >> > > need? >> >> > > > >> >> Tuning would be the next step in the process, but this tool >> >> would >> >> > > > answer >> >> > > > >> >> more fundamental questions about what a Metron deployment >> >> should >> >> > > look >> >> > > > >> like >> >> > > > >> >> given the number of telemetries and retention policies of >> the >> >> > > > >> enterprise. >> >> > > > >> >> >> >> > > > >> >> The best way to get this data (in my opinion) is to have >> some >> >> > tool >> >> > > > that >> >> > > > >> we >> >> > > > >> >> can plug into Metron’s point of ingest (kafka topics) and >> run >> >> > that >> >> > > > for >> >> > > > >> >> about a week or a month to be able to figure that out and >> spit >> >> > out >> >> > > > these >> >> > > > >> >> relevant metrics. Based on these metrics we can figure out >> the >> >> > > > >> fundamental >> >> > > > >> >> things about what metron should look like. Tuning would be >> the >> >> > > next >> >> > > > >> step. >> >> > > > >> >> >> >> > > > >> >> Thanks, >> >> > > > >> >> James >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> On 4/13/16, 9:52 AM, "George Vetticaden" < >> >> > > > [email protected]> >> >> > > > >> >> wrote: >> >> > > > >> >> >> >> > > > >> >> >I have used the following Kafka and Storm Best Practices >> guide >> >> > at >> >> > > > >> numerous >> >> > > > >> >> >customer implementations. >> >> > > > >> >> > >> >> > > > >> >> >> >> > > > >> >> >> > > > >> >> > > >> >> > >> >> >> https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b >> >> > > > >> >> >est-practices-guide.html >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> >We need to have something similar and prescriptive for >> Metron >> >> > > based >> >> > > > on: >> >> > > > >> >> >1. What data sources are we enabling >> >> > > > >> >> >2. What enrichment services are we enabling >> >> > > > >> >> >3. What threat intel services are we enabling >> >> > > > >> >> >4. What are we indexing into Solr/Elastic and how long >> >> > > > >> >> >5. What are we persisting into HDFS.. >> >> > > > >> >> > >> >> > > > >> >> >Ideally, the The metron assessment tool combined with an >> >> > > > introspection >> >> > > > >> of >> >> > > > >> >> >the user’s ansible configuration should drive what ambari >> >> > > blueprint >> >> > > > >> type >> >> > > > >> >> >and configuration should be used when the cluster is spun >> up >> >> and >> >> > > the >> >> > > > >> storm >> >> > > > >> >> >topology is deployed. >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> >-- >> >> > > > >> >> >George VetticadenPrincipal, COE >> >> > > > >> >> >[email protected] >> >> > > > >> >> >(630) 909-9138 >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> >On 4/13/16, 11:40 AM, "George Vetticaden" < >> >> > > > [email protected] >> >> > > > >> > >> >> > > > >> >> >wrote: >> >> > > > >> >> > >> >> > > > >> >> >>+ 1 to James suggestion. >> >> > > > >> >> >>We also need to consider not just the data volume and >> storage >> >> > > > >> >> requirements >> >> > > > >> >> >>for proper cluster sizing but also processing >> requirements as >> >> > > well. >> >> > > > >> Given >> >> > > > >> >> >>that in the new architecture, we have moved to single >> >> > enrichment >> >> > > > >> topology >> >> > > > >> >> >>that will support all data sources, proper sizing of the >> >> > > enrichment >> >> > > > >> >> >>topology will be even more crucial to maintain SLAs and >> HA >> >> > > > >> requirements. >> >> > > > >> >> >>The following key questions will apply to each parser >> >> topology >> >> > > and >> >> > > > >> single >> >> > > > >> >> >>enrichment topology >> >> > > > >> >> >> >> >> > > > >> >> >>1. Number of workers? >> >> > > > >> >> >>2. Number of workers per machine? >> >> > > > >> >> >>3. Size of each workers (in memory)? >> >> > > > >> >> >>4. Supervisor memory settings >> >> > > > >> >> >> >> >> > > > >> >> >>The assessment tool should also be used to size >> topologies >> >> > > > correctly >> >> > > > >> as >> >> > > > >> >> >>well. >> >> > > > >> >> >> >> >> > > > >> >> >>Tuning Kafka, Hbase and Solr/Elastic should also be >> governed >> >> by >> >> > > the >> >> > > > >> >> Metron >> >> > > > >> >> >>assessment tool. >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> >>-- >> >> > > > >> >> >>George Vetticaden >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> >>On 4/13/16, 11:28 AM, "James Sirota" < >> >> [email protected]> >> >> > > > wrote: >> >> > > > >> >> >> >> >> > > > >> >> >>>Prior to adoption of Metron each adopting entity needs >> to >> >> > > > guesstimate >> >> > > > >> >> >>>it¹s data volume and data storage requirements so they >> can >> >> > size >> >> > > > their >> >> > > > >> >> >>>cluster properly. I propose a creation of an assessment >> >> tool >> >> > > that >> >> > > > >> can >> >> > > > >> >> >>>plug in to a Kafka topic for a given telemetry and over >> time >> >> > > > produce >> >> > > > >> >> >>>statistics for ingest volumes and storage requirement. >> The >> >> > idea >> >> > > > is >> >> > > > >> that >> >> > > > >> >> >>>prior to adoption of Metron someone can set up all the >> feeds >> >> > and >> >> > > > >> kafka >> >> > > > >> >> >>>topics, but instead of deploying Metron right away they >> >> would >> >> > > > deploy >> >> > > > >> >> this >> >> > > > >> >> >>>tool. This tool would then produce statistics for data >> >> > > > >> ingest/storage >> >> > > > >> >> >>>requirement, and all relevant information needed for >> cluster >> >> > > > sizing. >> >> > > > >> >> >>> >> >> > > > >> >> >>>Some of the metrics that can be recorded are: >> >> > > > >> >> >>> >> >> > > > >> >> >>> * Number of system events per second (average, max, >> >> mean, >> >> > > > >> standard >> >> > > > >> >> >>>dev) >> >> > > > >> >> >>> * Message size (average, max, mean, standard dev) >> >> > > > >> >> >>> * Average number of peaks >> >> > > > >> >> >>> * Duration of peaks (average, max, mean, standard dev) >> >> > > > >> >> >>> >> >> > > > >> >> >>>If the parser for a telemetry exist the tool can produce >> >> > > > additional >> >> > > > >> >> >>>statistics >> >> > > > >> >> >>> >> >> > > > >> >> >>> * Number of keys/fields parsed (average, max, mean, >> >> > standard >> >> > > > dev) >> >> > > > >> >> >>> * Length of field parsed (average, max, mean, standard >> >> > dev) >> >> > > > >> >> >>> * Length of key parsed (average, max, mean, standard >> >> dev) >> >> > > > >> >> >>> >> >> > > > >> >> >>>The tool can run for a week or a month and produce these >> >> kinds >> >> > > of >> >> > > > >> >> >>>statistics. Then once the statistics are available we >> can >> >> > come >> >> > > up >> >> > > > >> with >> >> > > > >> >> a >> >> > > > >> >> >>>guidance documentation of recommended cluster setup. >> >> > Otherwise >> >> > > > it¹s >> >> > > > >> >> hard >> >> > > > >> >> >>>to properly size a cluster and setup streaming >> parallelism >> >> not >> >> > > > >> knowing >> >> > > > >> >> >>>these metrics. >> >> > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > > >> >> >>>Thoughts/ideas? >> >> > > > >> >> >>> >> >> > > > >> >> >>>Thanks, >> >> > > > >> >> >>>James >> >> > > > >> >> >> >> >> > > > >> >> >> >> >> > > > >> >> > >> >> > > > >> >> >> >> > > > >> >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > >-- >> >> > > > >Nick Allen <[email protected]> >> >> > > > >> >> > > >> >> > > >> >> > > >> >> > > -- >> >> > > Nick Allen <[email protected]> >> >> > > >> >> > -- >> >> > >> >> > Jon >> >> > >> >> >> >> -- >> >> Nick Allen <[email protected]> >> > -- >> > >> > Jon >> >> ------------------- >> Thank you, >> >> James Sirota >> PPMC- Apache Metron (Incubating) >> jsirota AT apache DOT org > -- > > Jon ------------------- Thank you, James Sirota PPMC- Apache Metron (Incubating) jsirota AT apache DOT org
