That's roughly what I was thinking as well. Time slices: - 1 minute (smallest unit - should catch the peaks and valleys of batch jobs that send to Kafka like crons) - 1 hour (a normal index size) - 4 hours (half of a typical SOC shift) - 24 hours (1 day) - 168 hours (1 week)
Stats for 1min: - Number of messages grouped by sensor. - Size of messages grouped by sensor. Stats for each time slice > 1min, with 1 min accuracy: - Number of messages (min, 1st quartile, mean, median, 3rd quartile, max, std dev) grouped by sensor. - Size of messages (min, 1st quartile, mean, median, 3rd quartile, max, std dev) grouped by sensor. That should give us the typical Volume/Velocity/Variety information that people like to see. Anything that I'm missing which could be helpful? I don't have a huge amount of experience with scoping hadoop clusters in general. Some outstanding items which I think should be addressed as a future enhancement: 1. There needs to be a way to generally gauge the overhead of enrichments. Currently I'm not sure of the best way to do this, given the current state of the project. 2. This may not be easy enough to set up in a large environment due to scale. The fix would be to introduce stats based on sampling. Thoughts? Jon On Tue, Jul 12, 2016, 17:53 James Sirota <[email protected]> wrote: > Per the Apache Way it would be desirable to put forth an architecture > proposal together to have the community take a look at it before > implementing. I would propose to have a simple storm topology that > attaches to a kafka topic and records statistics such as # of messages and > total throughput for n-sized time bins. What specific requirements do you > have in mind? > > Thanks, > James > > 12.07.2016, 14:41, "[email protected]" <[email protected]>: > > I can definitely give it a shot. A kickstart would be appreciated. > > > > Jom > > > > On Tue, Jul 12, 2016, 17:17 James Sirota <[email protected]> wrote: > > > >> John, > >> > >> Just field METRON-318. Is this something you would like to work on? > >> Would you like help from us to get started? > >> > >> Thanks, > >> James > >> > >> 12.07.2016, 11:53, "[email protected]" <[email protected]>: > >> > Hi All, > >> > > >> > Has there been any additional discussion or development regarding > this? I > >> > did take a brief look around the jira and didn't see anything > regarding > >> > this, but I may have missed it. Thanks, > >> > > >> > Jon > >> > > >> > On Fri, Apr 15, 2016 at 2:01 PM Nick Allen <[email protected]> > wrote: > >> > > >> >> I definitely agree that you need this level of understanding of your > >> >> cluster. It definitely could work the way that you describe. > >> >> > >> >> I was thinking of it slightly differently though. The metrics for > this > >> >> purpose (understanding performance of existing cluster) should come > >> from > >> >> the actual sensors themselves. For example, I need to instrument the > >> >> packet capture process so that it kicks out time-series-ish metrics > >> that > >> >> you can monitor in a dashboard over time. > >> >> > >> >> On Fri, Apr 15, 2016 at 1:40 PM, [email protected] <[email protected] > > > >> >> wrote: > >> >> > >> >> > However, it would be handy to have something like this perpetually > >> >> running > >> >> > so you know when to scale up/out/down/in a cluster. > >> >> > > >> >> > On Fri, Apr 15, 2016, 13:35 Nick Allen <[email protected]> > wrote: > >> >> > > >> >> > > I think it is slightly different. I don't even want to install > >> minimal > >> >> > > Kafka infrastructure (Look ma, no Kafka!) > >> >> > > > >> >> > > The exact implementation would differ based on the data inputs > >> that you > >> >> > are > >> >> > > trying to measure, but for example... > >> >> > > > >> >> > > - To understand raw packet rates I would have a specialized > sensor > >> >> > that > >> >> > > counts packets and size on the wire. It doesn't do anything more > >> >> than > >> >> > > that. > >> >> > > - To understand Netflow rates, it would watch for Netflow > packets > >> >> and > >> >> > > count those. > >> >> > > - To understand sizing around application logs, a sensor would > >> watch > >> >> > for > >> >> > > Syslog packets and count those. > >> >> > > > >> >> > > The implementation would be more similar to raw packet capture > with > >> >> some > >> >> > > DPI. No Hadoop-y components required. > >> >> > > > >> >> > > > >> >> > > > >> >> > > On Fri, Apr 15, 2016 at 1:10 PM, James Sirota < > >> [email protected] > >> >> > > >> >> > > wrote: > >> >> > > > >> >> > > > So this is exactly what I am proposing. Calculate the metrics > on > >> the > >> >> > fly > >> >> > > > without landing any data in the cluster. The problem is that > that > >> >> > > > enterprise data volumes are so large you can’t just point them > >> at a > >> >> > Java > >> >> > > or > >> >> > > > a C++ program or sensor. You either need an existing minimal > >> Kafka > >> >> > > > infrastructure to take that load or sample the data. > >> >> > > > > >> >> > > > Thanks, > >> >> > > > James > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > On 4/15/16, 9:54 AM, "Nick Allen" <[email protected]> wrote: > >> >> > > > > >> >> > > > >Or we have the assessment tool not actually land any data. > The > >> >> > > assessment > >> >> > > > >tool becomes a 'sensor' in its own right. You just point the > >> input > >> >> > data > >> >> > > > >sets at the assessment tool, it builds metrics on the input > (for > >> >> > > example: > >> >> > > > >count the number of packets per second) and then we use those > >> >> metrics > >> >> > to > >> >> > > > >estimate cluster size. > >> >> > > > > > >> >> > > > >On Wed, Apr 13, 2016 at 5:45 PM, James Sirota < > >> >> > [email protected]> > >> >> > > > >wrote: > >> >> > > > > > >> >> > > > >> That’s an excellent point. So I think there are three ways > >> >> forward. > >> >> > > > >> > >> >> > > > >> One is we can assume that there has to be at least a > minimal > >> >> > > > >> infrastructure in place (at least a subset of Kafka and > Storm > >> >> > > > resources) to > >> >> > > > >> run a full-scale assessment. If you point something that > >> blasts > >> >> > > > millions > >> >> > > > >> of messages per second at something like ActiveMQ you are > >> going to > >> >> > > blow > >> >> > > > >> up. So the infrastructure to at least receive these kinds > of > >> >> > message > >> >> > > > >> volumes has to exist as a pre-requisite. There is no way to > >> get > >> >> > around > >> >> > > > that. > >> >> > > > >> > >> >> > > > >> The second approach I see is sampling. Sampling is a lot > less > >> >> > precise > >> >> > > > and > >> >> > > > >> you can miss peaks that fall outside of your sampling > windows. > >> >> But > >> >> > > the > >> >> > > > >> obvious benefit is that you don’t need a cluster to process > >> these > >> >> > > > streams. > >> >> > > > >> You can probably perform most of your calculations with a > >> >> > > multithreaded > >> >> > > > >> java program. Sampling poses a few design challenges. > First, > >> >> where > >> >> > > do > >> >> > > > you > >> >> > > > >> sample? Do you sample on the sensor? (the implication here > is > >> >> that > >> >> > we > >> >> > > > have > >> >> > > > >> to program some sort of sampling capability in our > sensors) . > >> Do > >> >> you > >> >> > > > sample > >> >> > > > >> on transport? (maybe a Flume interceptor or NiFi > processor). > >> >> There > >> >> > is > >> >> > > > also > >> >> > > > >> a question of what the sampling rate should be. Not knowing > >> >> > > statistical > >> >> > > > >> properties of a stream ahead of time it’s hard to make that > >> call. > >> >> > > > >> > >> >> > > > >> The third option I think is MR job. We can blast the data > into > >> >> HDFS > >> >> > > and > >> >> > > > >> then go over it with MR to derive the metrics we are > looking > >> for. > >> >> > > Then > >> >> > > > we > >> >> > > > >> don’t have to sample or setup expensive infrastructure to > >> receive > >> >> a > >> >> > > > deluge > >> >> > > > >> of data. But then we run into the chicken and the egg > problem > >> >> that > >> >> > in > >> >> > > > >> order to size your HDFS you need to have data in HDFS. > Ideally > >> >> you > >> >> > > > need to > >> >> > > > >> capture at least one full weeks worth of logs because > patterns > >> >> > > > throughout > >> >> > > > >> the day as well as every day of the week have different > >> >> statistical > >> >> > > > >> properties. So you need off peak, on peak, weekdays and > >> weekends > >> >> to > >> >> > > > derive > >> >> > > > >> these stats in batch. > >> >> > > > >> > >> >> > > > >> Any other design ideas? > >> >> > > > >> > >> >> > > > >> Thanks, > >> >> > > > >> James > >> >> > > > >> > >> >> > > > >> > >> >> > > > >> > >> >> > > > >> > >> >> > > > >> > >> >> > > > >> On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> > wrote: > >> >> > > > >> > >> >> > > > >> >If the tool starts at Kafka, the user would have to > already > >> have > >> >> > > > committed > >> >> > > > >> >to the investment in the infrastructure and time to setup > the > >> >> > sensors > >> >> > > > that > >> >> > > > >> >feed Kafka and Kafka itself. Maybe it would need to be > >> further > >> >> > > > upstream? > >> >> > > > >> >On Apr 13, 2016 1:05 PM, "James Sirota" < > >> [email protected] > >> >> > > >> >> > > > wrote: > >> >> > > > >> > > >> >> > > > >> >> Hi Goerge, > >> >> > > > >> >> > >> >> > > > >> >> This article defines micro-tuning of the existing > cluster. > >> >> What > >> >> > I > >> >> > > am > >> >> > > > >> >> proposing is a level up from that. When you start with > >> Metron > >> >> > how > >> >> > > do > >> >> > > > >> you > >> >> > > > >> >> even know how many nodes you need? And of these nodes > how > >> many > >> >> > do > >> >> > > > you > >> >> > > > >> >> allocate to Storm, indexing, storage? How much storage > do > >> you > >> >> > > need? > >> >> > > > >> >> Tuning would be the next step in the process, but this > tool > >> >> would > >> >> > > > answer > >> >> > > > >> >> more fundamental questions about what a Metron > deployment > >> >> should > >> >> > > look > >> >> > > > >> like > >> >> > > > >> >> given the number of telemetries and retention policies > of > >> the > >> >> > > > >> enterprise. > >> >> > > > >> >> > >> >> > > > >> >> The best way to get this data (in my opinion) is to have > >> some > >> >> > tool > >> >> > > > that > >> >> > > > >> we > >> >> > > > >> >> can plug into Metron’s point of ingest (kafka topics) > and > >> run > >> >> > that > >> >> > > > for > >> >> > > > >> >> about a week or a month to be able to figure that out > and > >> spit > >> >> > out > >> >> > > > these > >> >> > > > >> >> relevant metrics. Based on these metrics we can figure > out > >> the > >> >> > > > >> fundamental > >> >> > > > >> >> things about what metron should look like. Tuning would > be > >> the > >> >> > > next > >> >> > > > >> step. > >> >> > > > >> >> > >> >> > > > >> >> Thanks, > >> >> > > > >> >> James > >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> > >> >> > > > >> >> On 4/13/16, 9:52 AM, "George Vetticaden" < > >> >> > > > [email protected]> > >> >> > > > >> >> wrote: > >> >> > > > >> >> > >> >> > > > >> >> >I have used the following Kafka and Storm Best > Practices > >> guide > >> >> > at > >> >> > > > >> numerous > >> >> > > > >> >> >customer implementations. > >> >> > > > >> >> > > >> >> > > > >> >> > >> >> > > > >> > >> >> > > > > >> >> > > > >> >> > > >> >> > >> > https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b > >> >> > > > >> >> >est-practices-guide.html > >> >> > > > >> >> > > >> >> > > > >> >> > > >> >> > > > >> >> >We need to have something similar and prescriptive for > >> Metron > >> >> > > based > >> >> > > > on: > >> >> > > > >> >> >1. What data sources are we enabling > >> >> > > > >> >> >2. What enrichment services are we enabling > >> >> > > > >> >> >3. What threat intel services are we enabling > >> >> > > > >> >> >4. What are we indexing into Solr/Elastic and how long > >> >> > > > >> >> >5. What are we persisting into HDFS.. > >> >> > > > >> >> > > >> >> > > > >> >> >Ideally, the The metron assessment tool combined with > an > >> >> > > > introspection > >> >> > > > >> of > >> >> > > > >> >> >the user’s ansible configuration should drive what > ambari > >> >> > > blueprint > >> >> > > > >> type > >> >> > > > >> >> >and configuration should be used when the cluster is > spun > >> up > >> >> and > >> >> > > the > >> >> > > > >> storm > >> >> > > > >> >> >topology is deployed. > >> >> > > > >> >> > > >> >> > > > >> >> > > >> >> > > > >> >> >-- > >> >> > > > >> >> >George VetticadenPrincipal, COE > >> >> > > > >> >> >[email protected] > >> >> > > > >> >> >(630) 909-9138 > >> >> > > > >> >> > > >> >> > > > >> >> > > >> >> > > > >> >> > > >> >> > > > >> >> > > >> >> > > > >> >> > > >> >> > > > >> >> >On 4/13/16, 11:40 AM, "George Vetticaden" < > >> >> > > > [email protected] > >> >> > > > >> > > >> >> > > > >> >> >wrote: > >> >> > > > >> >> > > >> >> > > > >> >> >>+ 1 to James suggestion. > >> >> > > > >> >> >>We also need to consider not just the data volume and > >> storage > >> >> > > > >> >> requirements > >> >> > > > >> >> >>for proper cluster sizing but also processing > >> requirements as > >> >> > > well. > >> >> > > > >> Given > >> >> > > > >> >> >>that in the new architecture, we have moved to single > >> >> > enrichment > >> >> > > > >> topology > >> >> > > > >> >> >>that will support all data sources, proper sizing of > the > >> >> > > enrichment > >> >> > > > >> >> >>topology will be even more crucial to maintain SLAs > and > >> HA > >> >> > > > >> requirements. > >> >> > > > >> >> >>The following key questions will apply to each parser > >> >> topology > >> >> > > and > >> >> > > > >> single > >> >> > > > >> >> >>enrichment topology > >> >> > > > >> >> >> > >> >> > > > >> >> >>1. Number of workers? > >> >> > > > >> >> >>2. Number of workers per machine? > >> >> > > > >> >> >>3. Size of each workers (in memory)? > >> >> > > > >> >> >>4. Supervisor memory settings > >> >> > > > >> >> >> > >> >> > > > >> >> >>The assessment tool should also be used to size > >> topologies > >> >> > > > correctly > >> >> > > > >> as > >> >> > > > >> >> >>well. > >> >> > > > >> >> >> > >> >> > > > >> >> >>Tuning Kafka, Hbase and Solr/Elastic should also be > >> governed > >> >> by > >> >> > > the > >> >> > > > >> >> Metron > >> >> > > > >> >> >>assessment tool. > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >>-- > >> >> > > > >> >> >>George Vetticaden > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> >>On 4/13/16, 11:28 AM, "James Sirota" < > >> >> [email protected]> > >> >> > > > wrote: > >> >> > > > >> >> >> > >> >> > > > >> >> >>>Prior to adoption of Metron each adopting entity > needs > >> to > >> >> > > > guesstimate > >> >> > > > >> >> >>>it¹s data volume and data storage requirements so > they > >> can > >> >> > size > >> >> > > > their > >> >> > > > >> >> >>>cluster properly. I propose a creation of an > assessment > >> >> tool > >> >> > > that > >> >> > > > >> can > >> >> > > > >> >> >>>plug in to a Kafka topic for a given telemetry and > over > >> time > >> >> > > > produce > >> >> > > > >> >> >>>statistics for ingest volumes and storage > requirement. > >> The > >> >> > idea > >> >> > > > is > >> >> > > > >> that > >> >> > > > >> >> >>>prior to adoption of Metron someone can set up all > the > >> feeds > >> >> > and > >> >> > > > >> kafka > >> >> > > > >> >> >>>topics, but instead of deploying Metron right away > they > >> >> would > >> >> > > > deploy > >> >> > > > >> >> this > >> >> > > > >> >> >>>tool. This tool would then produce statistics for > data > >> >> > > > >> ingest/storage > >> >> > > > >> >> >>>requirement, and all relevant information needed for > >> cluster > >> >> > > > sizing. > >> >> > > > >> >> >>> > >> >> > > > >> >> >>>Some of the metrics that can be recorded are: > >> >> > > > >> >> >>> > >> >> > > > >> >> >>> * Number of system events per second (average, max, > >> >> mean, > >> >> > > > >> standard > >> >> > > > >> >> >>>dev) > >> >> > > > >> >> >>> * Message size (average, max, mean, standard dev) > >> >> > > > >> >> >>> * Average number of peaks > >> >> > > > >> >> >>> * Duration of peaks (average, max, mean, standard > dev) > >> >> > > > >> >> >>> > >> >> > > > >> >> >>>If the parser for a telemetry exist the tool can > produce > >> >> > > > additional > >> >> > > > >> >> >>>statistics > >> >> > > > >> >> >>> > >> >> > > > >> >> >>> * Number of keys/fields parsed (average, max, mean, > >> >> > standard > >> >> > > > dev) > >> >> > > > >> >> >>> * Length of field parsed (average, max, mean, > standard > >> >> > dev) > >> >> > > > >> >> >>> * Length of key parsed (average, max, mean, standard > >> >> dev) > >> >> > > > >> >> >>> > >> >> > > > >> >> >>>The tool can run for a week or a month and produce > these > >> >> kinds > >> >> > > of > >> >> > > > >> >> >>>statistics. Then once the statistics are available we > >> can > >> >> > come > >> >> > > up > >> >> > > > >> with > >> >> > > > >> >> a > >> >> > > > >> >> >>>guidance documentation of recommended cluster setup. > >> >> > Otherwise > >> >> > > > it¹s > >> >> > > > >> >> hard > >> >> > > > >> >> >>>to properly size a cluster and setup streaming > >> parallelism > >> >> not > >> >> > > > >> knowing > >> >> > > > >> >> >>>these metrics. > >> >> > > > >> >> >>> > >> >> > > > >> >> >>> > >> >> > > > >> >> >>>Thoughts/ideas? > >> >> > > > >> >> >>> > >> >> > > > >> >> >>>Thanks, > >> >> > > > >> >> >>>James > >> >> > > > >> >> >> > >> >> > > > >> >> >> > >> >> > > > >> >> > > >> >> > > > >> >> > >> >> > > > >> > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > >-- > >> >> > > > >Nick Allen <[email protected]> > >> >> > > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > -- > >> >> > > Nick Allen <[email protected]> > >> >> > > > >> >> > -- > >> >> > > >> >> > Jon > >> >> > > >> >> > >> >> -- > >> >> Nick Allen <[email protected]> > >> > -- > >> > > >> > Jon > >> > >> ------------------- > >> Thank you, > >> > >> James Sirota > >> PPMC- Apache Metron (Incubating) > >> jsirota AT apache DOT org > > -- > > > > Jon > > ------------------- > Thank you, > > James Sirota > PPMC- Apache Metron (Incubating) > jsirota AT apache DOT org > -- Jon
