So this is exactly what I am proposing. Calculate the metrics on the fly without landing any data in the cluster. The problem is that that enterprise data volumes are so large you can’t just point them at a Java or a C++ program or sensor. You either need an existing minimal Kafka infrastructure to take that load or sample the data.
Thanks, James On 4/15/16, 9:54 AM, "Nick Allen" <[email protected]> wrote: >Or we have the assessment tool not actually land any data. The assessment >tool becomes a 'sensor' in its own right. You just point the input data >sets at the assessment tool, it builds metrics on the input (for example: >count the number of packets per second) and then we use those metrics to >estimate cluster size. > >On Wed, Apr 13, 2016 at 5:45 PM, James Sirota <[email protected]> >wrote: > >> That’s an excellent point. So I think there are three ways forward. >> >> One is we can assume that there has to be at least a minimal >> infrastructure in place (at least a subset of Kafka and Storm resources) to >> run a full-scale assessment. If you point something that blasts millions >> of messages per second at something like ActiveMQ you are going to blow >> up. So the infrastructure to at least receive these kinds of message >> volumes has to exist as a pre-requisite. There is no way to get around that. >> >> The second approach I see is sampling. Sampling is a lot less precise and >> you can miss peaks that fall outside of your sampling windows. But the >> obvious benefit is that you don’t need a cluster to process these streams. >> You can probably perform most of your calculations with a multithreaded >> java program. Sampling poses a few design challenges. First, where do you >> sample? Do you sample on the sensor? (the implication here is that we have >> to program some sort of sampling capability in our sensors) . Do you sample >> on transport? (maybe a Flume interceptor or NiFi processor). There is also >> a question of what the sampling rate should be. Not knowing statistical >> properties of a stream ahead of time it’s hard to make that call. >> >> The third option I think is MR job. We can blast the data into HDFS and >> then go over it with MR to derive the metrics we are looking for. Then we >> don’t have to sample or setup expensive infrastructure to receive a deluge >> of data. But then we run into the chicken and the egg problem that in >> order to size your HDFS you need to have data in HDFS. Ideally you need to >> capture at least one full weeks worth of logs because patterns throughout >> the day as well as every day of the week have different statistical >> properties. So you need off peak, on peak, weekdays and weekends to derive >> these stats in batch. >> >> Any other design ideas? >> >> Thanks, >> James >> >> >> >> >> >> On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> wrote: >> >> >If the tool starts at Kafka, the user would have to already have committed >> >to the investment in the infrastructure and time to setup the sensors that >> >feed Kafka and Kafka itself. Maybe it would need to be further upstream? >> >On Apr 13, 2016 1:05 PM, "James Sirota" <[email protected]> wrote: >> > >> >> Hi Goerge, >> >> >> >> This article defines micro-tuning of the existing cluster. What I am >> >> proposing is a level up from that. When you start with Metron how do >> you >> >> even know how many nodes you need? And of these nodes how many do you >> >> allocate to Storm, indexing, storage? How much storage do you need? >> >> Tuning would be the next step in the process, but this tool would answer >> >> more fundamental questions about what a Metron deployment should look >> like >> >> given the number of telemetries and retention policies of the >> enterprise. >> >> >> >> The best way to get this data (in my opinion) is to have some tool that >> we >> >> can plug into Metron’s point of ingest (kafka topics) and run that for >> >> about a week or a month to be able to figure that out and spit out these >> >> relevant metrics. Based on these metrics we can figure out the >> fundamental >> >> things about what metron should look like. Tuning would be the next >> step. >> >> >> >> Thanks, >> >> James >> >> >> >> >> >> >> >> >> >> On 4/13/16, 9:52 AM, "George Vetticaden" <[email protected]> >> >> wrote: >> >> >> >> >I have used the following Kafka and Storm Best Practices guide at >> numerous >> >> >customer implementations. >> >> > >> >> >> https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b >> >> >est-practices-guide.html >> >> > >> >> > >> >> >We need to have something similar and prescriptive for Metron based on: >> >> >1. What data sources are we enabling >> >> >2. What enrichment services are we enabling >> >> >3. What threat intel services are we enabling >> >> >4. What are we indexing into Solr/Elastic and how long >> >> >5. What are we persisting into HDFS.. >> >> > >> >> >Ideally, the The metron assessment tool combined with an introspection >> of >> >> >the user’s ansible configuration should drive what ambari blueprint >> type >> >> >and configuration should be used when the cluster is spun up and the >> storm >> >> >topology is deployed. >> >> > >> >> > >> >> >-- >> >> >George VetticadenPrincipal, COE >> >> >[email protected] >> >> >(630) 909-9138 >> >> > >> >> > >> >> > >> >> > >> >> > >> >> >On 4/13/16, 11:40 AM, "George Vetticaden" <[email protected] >> > >> >> >wrote: >> >> > >> >> >>+ 1 to James suggestion. >> >> >>We also need to consider not just the data volume and storage >> >> requirements >> >> >>for proper cluster sizing but also processing requirements as well. >> Given >> >> >>that in the new architecture, we have moved to single enrichment >> topology >> >> >>that will support all data sources, proper sizing of the enrichment >> >> >>topology will be even more crucial to maintain SLAs and HA >> requirements. >> >> >>The following key questions will apply to each parser topology and >> single >> >> >>enrichment topology >> >> >> >> >> >>1. Number of workers? >> >> >>2. Number of workers per machine? >> >> >>3. Size of each workers (in memory)? >> >> >>4. Supervisor memory settings >> >> >> >> >> >>The assessment tool should also be used to size topologies correctly >> as >> >> >>well. >> >> >> >> >> >>Tuning Kafka, Hbase and Solr/Elastic should also be governed by the >> >> Metron >> >> >>assessment tool. >> >> >> >> >> >> >> >> >>-- >> >> >>George Vetticaden >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>On 4/13/16, 11:28 AM, "James Sirota" <[email protected]> wrote: >> >> >> >> >> >>>Prior to adoption of Metron each adopting entity needs to guesstimate >> >> >>>it¹s data volume and data storage requirements so they can size their >> >> >>>cluster properly. I propose a creation of an assessment tool that >> can >> >> >>>plug in to a Kafka topic for a given telemetry and over time produce >> >> >>>statistics for ingest volumes and storage requirement. The idea is >> that >> >> >>>prior to adoption of Metron someone can set up all the feeds and >> kafka >> >> >>>topics, but instead of deploying Metron right away they would deploy >> >> this >> >> >>>tool. This tool would then produce statistics for data >> ingest/storage >> >> >>>requirement, and all relevant information needed for cluster sizing. >> >> >>> >> >> >>>Some of the metrics that can be recorded are: >> >> >>> >> >> >>> * Number of system events per second (average, max, mean, >> standard >> >> >>>dev) >> >> >>> * Message size (average, max, mean, standard dev) >> >> >>> * Average number of peaks >> >> >>> * Duration of peaks (average, max, mean, standard dev) >> >> >>> >> >> >>>If the parser for a telemetry exist the tool can produce additional >> >> >>>statistics >> >> >>> >> >> >>> * Number of keys/fields parsed (average, max, mean, standard dev) >> >> >>> * Length of field parsed (average, max, mean, standard dev) >> >> >>> * Length of key parsed (average, max, mean, standard dev) >> >> >>> >> >> >>>The tool can run for a week or a month and produce these kinds of >> >> >>>statistics. Then once the statistics are available we can come up >> with >> >> a >> >> >>>guidance documentation of recommended cluster setup. Otherwise it¹s >> >> hard >> >> >>>to properly size a cluster and setup streaming parallelism not >> knowing >> >> >>>these metrics. >> >> >>> >> >> >>> >> >> >>>Thoughts/ideas? >> >> >>> >> >> >>>Thanks, >> >> >>>James >> >> >> >> >> >> >> >> > >> >> >> > > > >-- >Nick Allen <[email protected]>
