However, it would be handy to have something like this perpetually running so you know when to scale up/out/down/in a cluster.
On Fri, Apr 15, 2016, 13:35 Nick Allen <[email protected]> wrote: > I think it is slightly different. I don't even want to install minimal > Kafka infrastructure (Look ma, no Kafka!) > > The exact implementation would differ based on the data inputs that you are > trying to measure, but for example... > > - To understand raw packet rates I would have a specialized sensor that > counts packets and size on the wire. It doesn't do anything more than > that. > - To understand Netflow rates, it would watch for Netflow packets and > count those. > - To understand sizing around application logs, a sensor would watch for > Syslog packets and count those. > > The implementation would be more similar to raw packet capture with some > DPI. No Hadoop-y components required. > > > > On Fri, Apr 15, 2016 at 1:10 PM, James Sirota <[email protected]> > wrote: > > > So this is exactly what I am proposing. Calculate the metrics on the fly > > without landing any data in the cluster. The problem is that that > > enterprise data volumes are so large you can’t just point them at a Java > or > > a C++ program or sensor. You either need an existing minimal Kafka > > infrastructure to take that load or sample the data. > > > > Thanks, > > James > > > > > > > > > > On 4/15/16, 9:54 AM, "Nick Allen" <[email protected]> wrote: > > > > >Or we have the assessment tool not actually land any data. The > assessment > > >tool becomes a 'sensor' in its own right. You just point the input data > > >sets at the assessment tool, it builds metrics on the input (for > example: > > >count the number of packets per second) and then we use those metrics to > > >estimate cluster size. > > > > > >On Wed, Apr 13, 2016 at 5:45 PM, James Sirota <[email protected]> > > >wrote: > > > > > >> That’s an excellent point. So I think there are three ways forward. > > >> > > >> One is we can assume that there has to be at least a minimal > > >> infrastructure in place (at least a subset of Kafka and Storm > > resources) to > > >> run a full-scale assessment. If you point something that blasts > > millions > > >> of messages per second at something like ActiveMQ you are going to > blow > > >> up. So the infrastructure to at least receive these kinds of message > > >> volumes has to exist as a pre-requisite. There is no way to get around > > that. > > >> > > >> The second approach I see is sampling. Sampling is a lot less precise > > and > > >> you can miss peaks that fall outside of your sampling windows. But > the > > >> obvious benefit is that you don’t need a cluster to process these > > streams. > > >> You can probably perform most of your calculations with a > multithreaded > > >> java program. Sampling poses a few design challenges. First, where > do > > you > > >> sample? Do you sample on the sensor? (the implication here is that we > > have > > >> to program some sort of sampling capability in our sensors) . Do you > > sample > > >> on transport? (maybe a Flume interceptor or NiFi processor). There is > > also > > >> a question of what the sampling rate should be. Not knowing > statistical > > >> properties of a stream ahead of time it’s hard to make that call. > > >> > > >> The third option I think is MR job. We can blast the data into HDFS > and > > >> then go over it with MR to derive the metrics we are looking for. > Then > > we > > >> don’t have to sample or setup expensive infrastructure to receive a > > deluge > > >> of data. But then we run into the chicken and the egg problem that in > > >> order to size your HDFS you need to have data in HDFS. Ideally you > > need to > > >> capture at least one full weeks worth of logs because patterns > > throughout > > >> the day as well as every day of the week have different statistical > > >> properties. So you need off peak, on peak, weekdays and weekends to > > derive > > >> these stats in batch. > > >> > > >> Any other design ideas? > > >> > > >> Thanks, > > >> James > > >> > > >> > > >> > > >> > > >> > > >> On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> wrote: > > >> > > >> >If the tool starts at Kafka, the user would have to already have > > committed > > >> >to the investment in the infrastructure and time to setup the sensors > > that > > >> >feed Kafka and Kafka itself. Maybe it would need to be further > > upstream? > > >> >On Apr 13, 2016 1:05 PM, "James Sirota" <[email protected]> > > wrote: > > >> > > > >> >> Hi Goerge, > > >> >> > > >> >> This article defines micro-tuning of the existing cluster. What I > am > > >> >> proposing is a level up from that. When you start with Metron how > do > > >> you > > >> >> even know how many nodes you need? And of these nodes how many do > > you > > >> >> allocate to Storm, indexing, storage? How much storage do you > need? > > >> >> Tuning would be the next step in the process, but this tool would > > answer > > >> >> more fundamental questions about what a Metron deployment should > look > > >> like > > >> >> given the number of telemetries and retention policies of the > > >> enterprise. > > >> >> > > >> >> The best way to get this data (in my opinion) is to have some tool > > that > > >> we > > >> >> can plug into Metron’s point of ingest (kafka topics) and run that > > for > > >> >> about a week or a month to be able to figure that out and spit out > > these > > >> >> relevant metrics. Based on these metrics we can figure out the > > >> fundamental > > >> >> things about what metron should look like. Tuning would be the > next > > >> step. > > >> >> > > >> >> Thanks, > > >> >> James > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> On 4/13/16, 9:52 AM, "George Vetticaden" < > > [email protected]> > > >> >> wrote: > > >> >> > > >> >> >I have used the following Kafka and Storm Best Practices guide at > > >> numerous > > >> >> >customer implementations. > > >> >> > > > >> >> > > >> > > > https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b > > >> >> >est-practices-guide.html > > >> >> > > > >> >> > > > >> >> >We need to have something similar and prescriptive for Metron > based > > on: > > >> >> >1. What data sources are we enabling > > >> >> >2. What enrichment services are we enabling > > >> >> >3. What threat intel services are we enabling > > >> >> >4. What are we indexing into Solr/Elastic and how long > > >> >> >5. What are we persisting into HDFS.. > > >> >> > > > >> >> >Ideally, the The metron assessment tool combined with an > > introspection > > >> of > > >> >> >the user’s ansible configuration should drive what ambari > blueprint > > >> type > > >> >> >and configuration should be used when the cluster is spun up and > the > > >> storm > > >> >> >topology is deployed. > > >> >> > > > >> >> > > > >> >> >-- > > >> >> >George VetticadenPrincipal, COE > > >> >> >[email protected] > > >> >> >(630) 909-9138 > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> >On 4/13/16, 11:40 AM, "George Vetticaden" < > > [email protected] > > >> > > > >> >> >wrote: > > >> >> > > > >> >> >>+ 1 to James suggestion. > > >> >> >>We also need to consider not just the data volume and storage > > >> >> requirements > > >> >> >>for proper cluster sizing but also processing requirements as > well. > > >> Given > > >> >> >>that in the new architecture, we have moved to single enrichment > > >> topology > > >> >> >>that will support all data sources, proper sizing of the > enrichment > > >> >> >>topology will be even more crucial to maintain SLAs and HA > > >> requirements. > > >> >> >>The following key questions will apply to each parser topology > and > > >> single > > >> >> >>enrichment topology > > >> >> >> > > >> >> >>1. Number of workers? > > >> >> >>2. Number of workers per machine? > > >> >> >>3. Size of each workers (in memory)? > > >> >> >>4. Supervisor memory settings > > >> >> >> > > >> >> >>The assessment tool should also be used to size topologies > > correctly > > >> as > > >> >> >>well. > > >> >> >> > > >> >> >>Tuning Kafka, Hbase and Solr/Elastic should also be governed by > the > > >> >> Metron > > >> >> >>assessment tool. > > >> >> >> > > >> >> >> > > >> >> >>-- > > >> >> >>George Vetticaden > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >>On 4/13/16, 11:28 AM, "James Sirota" <[email protected]> > > wrote: > > >> >> >> > > >> >> >>>Prior to adoption of Metron each adopting entity needs to > > guesstimate > > >> >> >>>it¹s data volume and data storage requirements so they can size > > their > > >> >> >>>cluster properly. I propose a creation of an assessment tool > that > > >> can > > >> >> >>>plug in to a Kafka topic for a given telemetry and over time > > produce > > >> >> >>>statistics for ingest volumes and storage requirement. The idea > > is > > >> that > > >> >> >>>prior to adoption of Metron someone can set up all the feeds and > > >> kafka > > >> >> >>>topics, but instead of deploying Metron right away they would > > deploy > > >> >> this > > >> >> >>>tool. This tool would then produce statistics for data > > >> ingest/storage > > >> >> >>>requirement, and all relevant information needed for cluster > > sizing. > > >> >> >>> > > >> >> >>>Some of the metrics that can be recorded are: > > >> >> >>> > > >> >> >>> * Number of system events per second (average, max, mean, > > >> standard > > >> >> >>>dev) > > >> >> >>> * Message size (average, max, mean, standard dev) > > >> >> >>> * Average number of peaks > > >> >> >>> * Duration of peaks (average, max, mean, standard dev) > > >> >> >>> > > >> >> >>>If the parser for a telemetry exist the tool can produce > > additional > > >> >> >>>statistics > > >> >> >>> > > >> >> >>> * Number of keys/fields parsed (average, max, mean, standard > > dev) > > >> >> >>> * Length of field parsed (average, max, mean, standard dev) > > >> >> >>> * Length of key parsed (average, max, mean, standard dev) > > >> >> >>> > > >> >> >>>The tool can run for a week or a month and produce these kinds > of > > >> >> >>>statistics. Then once the statistics are available we can come > up > > >> with > > >> >> a > > >> >> >>>guidance documentation of recommended cluster setup. Otherwise > > it¹s > > >> >> hard > > >> >> >>>to properly size a cluster and setup streaming parallelism not > > >> knowing > > >> >> >>>these metrics. > > >> >> >>> > > >> >> >>> > > >> >> >>>Thoughts/ideas? > > >> >> >>> > > >> >> >>>Thanks, > > >> >> >>>James > > >> >> >> > > >> >> >> > > >> >> > > > >> >> > > >> > > > > > > > > > > > >-- > > >Nick Allen <[email protected]> > > > > > > -- > Nick Allen <[email protected]> > -- Jon
