So this is exactly what I am proposing.  Calculate the metrics on the fly 
without landing any data in the cluster.  The problem is that that enterprise 
data volumes are so large you can’t just point them at a Java or a C++ program 
or sensor.  You either need an existing minimal Kafka infrastructure to take 
that load or sample the data.

Thanks,
James 




On 4/15/16, 9:54 AM, "Nick Allen" <[email protected]> wrote:

>Or we have the assessment tool not actually land any data.  The assessment
>tool becomes a 'sensor' in its own right.  You just point the input data
>sets at the assessment tool, it builds metrics on the input (for example:
>count the number of packets per second) and then we use those metrics to
>estimate cluster size.
>
>On Wed, Apr 13, 2016 at 5:45 PM, James Sirota <[email protected]>
>wrote:
>
>> That’s an excellent point.  So I think there are three ways forward.
>>
>> One is we can assume that there has to be at least a minimal
>> infrastructure in place (at least a subset of Kafka and Storm resources) to
>> run a full-scale assessment.  If you point something that blasts millions
>> of messages per second at something like ActiveMQ you are going to blow
>> up.  So the infrastructure to at least receive these kinds of message
>> volumes has to exist as a pre-requisite. There is no way to get around that.
>>
>> The second approach I see is sampling.  Sampling is a lot less precise and
>> you can miss peaks that fall outside of your sampling windows.  But the
>> obvious benefit is that you don’t need a cluster to process these streams.
>> You can probably perform most of your calculations with a multithreaded
>> java program.  Sampling poses a few design challenges.  First, where do you
>> sample?  Do you sample on the sensor? (the implication here is that we have
>> to program some sort of sampling capability in our sensors) . Do you sample
>> on transport? (maybe a Flume interceptor or NiFi processor).  There is also
>> a question of what the sampling rate should be.  Not knowing statistical
>> properties of a stream ahead of time it’s hard to make that call.
>>
>> The third option I think is MR job.  We can blast the data into HDFS and
>> then go over it with MR to derive the metrics we are looking for.  Then we
>> don’t have to sample or setup expensive infrastructure to receive a deluge
>> of data.  But then we run into the chicken and the egg problem that in
>> order to size your HDFS you need to have data in HDFS.  Ideally you need to
>> capture at least one full weeks worth of logs because patterns throughout
>> the day as well as every day of the week have different statistical
>> properties.  So you need off peak, on peak, weekdays and weekends to derive
>> these stats in batch.
>>
>> Any other design ideas?
>>
>> Thanks,
>> James
>>
>>
>>
>>
>>
>> On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> wrote:
>>
>> >If the tool starts at Kafka, the user would have to already have committed
>> >to the investment in the infrastructure and time to setup the sensors that
>> >feed Kafka and Kafka itself.  Maybe it would need to be further upstream?
>> >On Apr 13, 2016 1:05 PM, "James Sirota" <[email protected]> wrote:
>> >
>> >> Hi Goerge,
>> >>
>> >> This article defines micro-tuning of the existing cluster.  What I am
>> >> proposing is a level up from that.  When you start with Metron how do
>> you
>> >> even know how many nodes you need?  And of these nodes how many do you
>> >> allocate to Storm, indexing, storage?  How much storage do you need?
>> >> Tuning would be the next step in the process, but this tool would answer
>> >> more fundamental questions about what a Metron deployment should look
>> like
>> >> given the number of telemetries and retention policies of the
>> enterprise.
>> >>
>> >> The best way to get this data (in my opinion) is to have some tool that
>> we
>> >> can plug into Metron’s point of ingest (kafka topics) and run that for
>> >> about a week or a month to be able to figure that out and spit out these
>> >> relevant metrics.  Based on these metrics we can figure out the
>> fundamental
>> >> things about what metron should look like.  Tuning would be the next
>> step.
>> >>
>> >> Thanks,
>> >> James
>> >>
>> >>
>> >>
>> >>
>> >> On 4/13/16, 9:52 AM, "George Vetticaden" <[email protected]>
>> >> wrote:
>> >>
>> >> >I have used the following Kafka and Storm Best Practices guide at
>> numerous
>> >> >customer implementations.
>> >> >
>> >>
>> https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b
>> >> >est-practices-guide.html
>> >> >
>> >> >
>> >> >We need to have something similar and prescriptive for Metron based on:
>> >> >1. What data sources are we enabling
>> >> >2. What enrichment services are we enabling
>> >> >3. What threat intel services are we enabling
>> >> >4. What are we indexing into Solr/Elastic and how long
>> >> >5. What are we persisting into HDFS..
>> >> >
>> >> >Ideally, the The metron assessment tool combined with an introspection
>> of
>> >> >the user’s  ansible configuration should drive what ambari blueprint
>> type
>> >> >and configuration should be used when the cluster is spun up and the
>> storm
>> >> >topology is deployed.
>> >> >
>> >> >
>> >> >--
>> >> >George VetticadenPrincipal, COE
>> >> >[email protected]
>> >> >(630) 909-9138
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >On 4/13/16, 11:40 AM, "George Vetticaden" <[email protected]
>> >
>> >> >wrote:
>> >> >
>> >> >>+ 1 to James suggestion.
>> >> >>We also need to consider not just the data volume and storage
>> >> requirements
>> >> >>for proper cluster sizing but also processing requirements as well.
>> Given
>> >> >>that in the new architecture, we have moved to single enrichment
>> topology
>> >> >>that will support all data sources, proper sizing of the enrichment
>> >> >>topology  will be even more crucial to maintain SLAs and HA
>> requirements.
>> >> >>The following key questions will apply to each parser topology and
>> single
>> >> >>enrichment topology
>> >> >>
>> >> >>1. Number of workers?
>> >> >>2. Number of workers per machine?
>> >> >>3. Size of each workers (in memory)?
>> >> >>4. Supervisor memory settings
>> >> >>
>> >> >>The assessment tool should also be used to size topologies correctly
>> as
>> >> >>well.
>> >> >>
>> >> >>Tuning Kafka, Hbase and Solr/Elastic should also be governed by the
>> >> Metron
>> >> >>assessment tool.
>> >> >>
>> >> >>
>> >> >>--
>> >> >>George Vetticaden
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>On 4/13/16, 11:28 AM, "James Sirota" <[email protected]> wrote:
>> >> >>
>> >> >>>Prior to adoption of Metron each adopting entity needs to guesstimate
>> >> >>>it¹s data volume and data storage requirements so they can size their
>> >> >>>cluster properly.  I propose a creation of an assessment tool that
>> can
>> >> >>>plug in to a Kafka topic for a given telemetry and over time produce
>> >> >>>statistics for ingest volumes and storage requirement.  The idea is
>> that
>> >> >>>prior to adoption of Metron someone can set up all the feeds and
>> kafka
>> >> >>>topics, but instead of deploying Metron right away they would deploy
>> >> this
>> >> >>>tool.  This tool would then produce statistics for data
>> ingest/storage
>> >> >>>requirement, and all relevant information needed for cluster sizing.
>> >> >>>
>> >> >>>Some of the metrics that can be recorded are:
>> >> >>>
>> >> >>>  *   Number of system events per second (average, max, mean,
>> standard
>> >> >>>dev)
>> >> >>>  *   Message size  (average, max, mean, standard dev)
>> >> >>>  *   Average number of peaks
>> >> >>>  *   Duration of peaks  (average, max, mean, standard dev)
>> >> >>>
>> >> >>>If the parser for a telemetry exist the tool can produce additional
>> >> >>>statistics
>> >> >>>
>> >> >>>  *   Number of keys/fields parsed (average, max, mean, standard dev)
>> >> >>>  *   Length of field parsed (average, max, mean, standard dev)
>> >> >>>  *   Length of key parsed (average, max, mean, standard dev)
>> >> >>>
>> >> >>>The tool can run for a week or a month and produce these kinds of
>> >> >>>statistics.  Then once the statistics are available we can come up
>> with
>> >> a
>> >> >>>guidance documentation of recommended cluster setup.  Otherwise it¹s
>> >> hard
>> >> >>>to properly size a cluster and setup streaming parallelism not
>> knowing
>> >> >>>these metrics.
>> >> >>>
>> >> >>>
>> >> >>>Thoughts/ideas?
>> >> >>>
>> >> >>>Thanks,
>> >> >>>James
>> >> >>
>> >> >>
>> >> >
>> >>
>>
>
>
>
>-- 
>Nick Allen <[email protected]>

Reply via email to