That’s an excellent point.  So I think there are three ways forward.  

One is we can assume that there has to be at least a minimal infrastructure in 
place (at least a subset of Kafka and Storm resources) to run a full-scale 
assessment.  If you point something that blasts millions of messages per second 
at something like ActiveMQ you are going to blow up.  So the infrastructure to 
at least receive these kinds of message volumes has to exist as a 
pre-requisite. There is no way to get around that. 

The second approach I see is sampling.  Sampling is a lot less precise and you 
can miss peaks that fall outside of your sampling windows.  But the obvious 
benefit is that you don’t need a cluster to process these streams.  You can 
probably perform most of your calculations with a multithreaded java program.  
Sampling poses a few design challenges.  First, where do you sample?  Do you 
sample on the sensor? (the implication here is that we have to program some 
sort of sampling capability in our sensors) . Do you sample on transport? 
(maybe a Flume interceptor or NiFi processor).  There is also a question of 
what the sampling rate should be.  Not knowing statistical properties of a 
stream ahead of time it’s hard to make that call.  

The third option I think is MR job.  We can blast the data into HDFS and then 
go over it with MR to derive the metrics we are looking for.  Then we don’t 
have to sample or setup expensive infrastructure to receive a deluge of data.  
But then we run into the chicken and the egg problem that in order to size your 
HDFS you need to have data in HDFS.  Ideally you need to capture at least one 
full weeks worth of logs because patterns throughout the day as well as every 
day of the week have different statistical properties.  So you need off peak, 
on peak, weekdays and weekends to derive these stats in batch. 

Any other design ideas?

Thanks,
James 





On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> wrote:

>If the tool starts at Kafka, the user would have to already have committed
>to the investment in the infrastructure and time to setup the sensors that
>feed Kafka and Kafka itself.  Maybe it would need to be further upstream?
>On Apr 13, 2016 1:05 PM, "James Sirota" <[email protected]> wrote:
>
>> Hi Goerge,
>>
>> This article defines micro-tuning of the existing cluster.  What I am
>> proposing is a level up from that.  When you start with Metron how do you
>> even know how many nodes you need?  And of these nodes how many do you
>> allocate to Storm, indexing, storage?  How much storage do you need?
>> Tuning would be the next step in the process, but this tool would answer
>> more fundamental questions about what a Metron deployment should look like
>> given the number of telemetries and retention policies of the enterprise.
>>
>> The best way to get this data (in my opinion) is to have some tool that we
>> can plug into Metron’s point of ingest (kafka topics) and run that for
>> about a week or a month to be able to figure that out and spit out these
>> relevant metrics.  Based on these metrics we can figure out the fundamental
>> things about what metron should look like.  Tuning would be the next step.
>>
>> Thanks,
>> James
>>
>>
>>
>>
>> On 4/13/16, 9:52 AM, "George Vetticaden" <[email protected]>
>> wrote:
>>
>> >I have used the following Kafka and Storm Best Practices guide at numerous
>> >customer implementations.
>> >
>> https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b
>> >est-practices-guide.html
>> >
>> >
>> >We need to have something similar and prescriptive for Metron based on:
>> >1. What data sources are we enabling
>> >2. What enrichment services are we enabling
>> >3. What threat intel services are we enabling
>> >4. What are we indexing into Solr/Elastic and how long
>> >5. What are we persisting into HDFS..
>> >
>> >Ideally, the The metron assessment tool combined with an introspection of
>> >the user’s  ansible configuration should drive what ambari blueprint type
>> >and configuration should be used when the cluster is spun up and the storm
>> >topology is deployed.
>> >
>> >
>> >--
>> >George VetticadenPrincipal, COE
>> >[email protected]
>> >(630) 909-9138
>> >
>> >
>> >
>> >
>> >
>> >On 4/13/16, 11:40 AM, "George Vetticaden" <[email protected]>
>> >wrote:
>> >
>> >>+ 1 to James suggestion.
>> >>We also need to consider not just the data volume and storage
>> requirements
>> >>for proper cluster sizing but also processing requirements as well. Given
>> >>that in the new architecture, we have moved to single enrichment topology
>> >>that will support all data sources, proper sizing of the enrichment
>> >>topology  will be even more crucial to maintain SLAs and HA requirements.
>> >>The following key questions will apply to each parser topology and single
>> >>enrichment topology
>> >>
>> >>1. Number of workers?
>> >>2. Number of workers per machine?
>> >>3. Size of each workers (in memory)?
>> >>4. Supervisor memory settings
>> >>
>> >>The assessment tool should also be used to size topologies correctly as
>> >>well.
>> >>
>> >>Tuning Kafka, Hbase and Solr/Elastic should also be governed by the
>> Metron
>> >>assessment tool.
>> >>
>> >>
>> >>--
>> >>George Vetticaden
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>On 4/13/16, 11:28 AM, "James Sirota" <[email protected]> wrote:
>> >>
>> >>>Prior to adoption of Metron each adopting entity needs to guesstimate
>> >>>it¹s data volume and data storage requirements so they can size their
>> >>>cluster properly.  I propose a creation of an assessment tool that can
>> >>>plug in to a Kafka topic for a given telemetry and over time produce
>> >>>statistics for ingest volumes and storage requirement.  The idea is that
>> >>>prior to adoption of Metron someone can set up all the feeds and kafka
>> >>>topics, but instead of deploying Metron right away they would deploy
>> this
>> >>>tool.  This tool would then produce statistics for data ingest/storage
>> >>>requirement, and all relevant information needed for cluster sizing.
>> >>>
>> >>>Some of the metrics that can be recorded are:
>> >>>
>> >>>  *   Number of system events per second (average, max, mean, standard
>> >>>dev)
>> >>>  *   Message size  (average, max, mean, standard dev)
>> >>>  *   Average number of peaks
>> >>>  *   Duration of peaks  (average, max, mean, standard dev)
>> >>>
>> >>>If the parser for a telemetry exist the tool can produce additional
>> >>>statistics
>> >>>
>> >>>  *   Number of keys/fields parsed (average, max, mean, standard dev)
>> >>>  *   Length of field parsed (average, max, mean, standard dev)
>> >>>  *   Length of key parsed (average, max, mean, standard dev)
>> >>>
>> >>>The tool can run for a week or a month and produce these kinds of
>> >>>statistics.  Then once the statistics are available we can come up with
>> a
>> >>>guidance documentation of recommended cluster setup.  Otherwise it¹s
>> hard
>> >>>to properly size a cluster and setup streaming parallelism not knowing
>> >>>these metrics.
>> >>>
>> >>>
>> >>>Thoughts/ideas?
>> >>>
>> >>>Thanks,
>> >>>James
>> >>
>> >>
>> >
>>

Reply via email to