However, it would be handy to have something like this perpetually running
so you know when to scale up/out/down/in a cluster.

On Fri, Apr 15, 2016, 13:35 Nick Allen <[email protected]> wrote:

> I think it is slightly different.  I don't even want to install minimal
> Kafka infrastructure (Look ma, no Kafka!)
>
> The exact implementation would differ based on the data inputs that you are
> trying to measure, but for example...
>
>    - To understand raw packet rates I would have a specialized sensor that
>    counts packets and size on the wire.  It doesn't do anything more than
> that.
>    - To understand Netflow rates, it would watch for Netflow packets and
>    count those.
>    - To understand sizing around application logs, a sensor would watch for
>    Syslog packets and count those.
>
> The implementation would be more similar to raw packet capture with some
> DPI.  No Hadoop-y components required.
>
>
>
> On Fri, Apr 15, 2016 at 1:10 PM, James Sirota <[email protected]>
> wrote:
>
> > So this is exactly what I am proposing.  Calculate the metrics on the fly
> > without landing any data in the cluster.  The problem is that that
> > enterprise data volumes are so large you can’t just point them at a Java
> or
> > a C++ program or sensor.  You either need an existing minimal Kafka
> > infrastructure to take that load or sample the data.
> >
> > Thanks,
> > James
> >
> >
> >
> >
> > On 4/15/16, 9:54 AM, "Nick Allen" <[email protected]> wrote:
> >
> > >Or we have the assessment tool not actually land any data.  The
> assessment
> > >tool becomes a 'sensor' in its own right.  You just point the input data
> > >sets at the assessment tool, it builds metrics on the input (for
> example:
> > >count the number of packets per second) and then we use those metrics to
> > >estimate cluster size.
> > >
> > >On Wed, Apr 13, 2016 at 5:45 PM, James Sirota <[email protected]>
> > >wrote:
> > >
> > >> That’s an excellent point.  So I think there are three ways forward.
> > >>
> > >> One is we can assume that there has to be at least a minimal
> > >> infrastructure in place (at least a subset of Kafka and Storm
> > resources) to
> > >> run a full-scale assessment.  If you point something that blasts
> > millions
> > >> of messages per second at something like ActiveMQ you are going to
> blow
> > >> up.  So the infrastructure to at least receive these kinds of message
> > >> volumes has to exist as a pre-requisite. There is no way to get around
> > that.
> > >>
> > >> The second approach I see is sampling.  Sampling is a lot less precise
> > and
> > >> you can miss peaks that fall outside of your sampling windows.  But
> the
> > >> obvious benefit is that you don’t need a cluster to process these
> > streams.
> > >> You can probably perform most of your calculations with a
> multithreaded
> > >> java program.  Sampling poses a few design challenges.  First, where
> do
> > you
> > >> sample?  Do you sample on the sensor? (the implication here is that we
> > have
> > >> to program some sort of sampling capability in our sensors) . Do you
> > sample
> > >> on transport? (maybe a Flume interceptor or NiFi processor).  There is
> > also
> > >> a question of what the sampling rate should be.  Not knowing
> statistical
> > >> properties of a stream ahead of time it’s hard to make that call.
> > >>
> > >> The third option I think is MR job.  We can blast the data into HDFS
> and
> > >> then go over it with MR to derive the metrics we are looking for.
> Then
> > we
> > >> don’t have to sample or setup expensive infrastructure to receive a
> > deluge
> > >> of data.  But then we run into the chicken and the egg problem that in
> > >> order to size your HDFS you need to have data in HDFS.  Ideally you
> > need to
> > >> capture at least one full weeks worth of logs because patterns
> > throughout
> > >> the day as well as every day of the week have different statistical
> > >> properties.  So you need off peak, on peak, weekdays and weekends to
> > derive
> > >> these stats in batch.
> > >>
> > >> Any other design ideas?
> > >>
> > >> Thanks,
> > >> James
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 4/13/16, 1:59 PM, "Nick Allen" <[email protected]> wrote:
> > >>
> > >> >If the tool starts at Kafka, the user would have to already have
> > committed
> > >> >to the investment in the infrastructure and time to setup the sensors
> > that
> > >> >feed Kafka and Kafka itself.  Maybe it would need to be further
> > upstream?
> > >> >On Apr 13, 2016 1:05 PM, "James Sirota" <[email protected]>
> > wrote:
> > >> >
> > >> >> Hi Goerge,
> > >> >>
> > >> >> This article defines micro-tuning of the existing cluster.  What I
> am
> > >> >> proposing is a level up from that.  When you start with Metron how
> do
> > >> you
> > >> >> even know how many nodes you need?  And of these nodes how many do
> > you
> > >> >> allocate to Storm, indexing, storage?  How much storage do you
> need?
> > >> >> Tuning would be the next step in the process, but this tool would
> > answer
> > >> >> more fundamental questions about what a Metron deployment should
> look
> > >> like
> > >> >> given the number of telemetries and retention policies of the
> > >> enterprise.
> > >> >>
> > >> >> The best way to get this data (in my opinion) is to have some tool
> > that
> > >> we
> > >> >> can plug into Metron’s point of ingest (kafka topics) and run that
> > for
> > >> >> about a week or a month to be able to figure that out and spit out
> > these
> > >> >> relevant metrics.  Based on these metrics we can figure out the
> > >> fundamental
> > >> >> things about what metron should look like.  Tuning would be the
> next
> > >> step.
> > >> >>
> > >> >> Thanks,
> > >> >> James
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> On 4/13/16, 9:52 AM, "George Vetticaden" <
> > [email protected]>
> > >> >> wrote:
> > >> >>
> > >> >> >I have used the following Kafka and Storm Best Practices guide at
> > >> numerous
> > >> >> >customer implementations.
> > >> >> >
> > >> >>
> > >>
> >
> https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-b
> > >> >> >est-practices-guide.html
> > >> >> >
> > >> >> >
> > >> >> >We need to have something similar and prescriptive for Metron
> based
> > on:
> > >> >> >1. What data sources are we enabling
> > >> >> >2. What enrichment services are we enabling
> > >> >> >3. What threat intel services are we enabling
> > >> >> >4. What are we indexing into Solr/Elastic and how long
> > >> >> >5. What are we persisting into HDFS..
> > >> >> >
> > >> >> >Ideally, the The metron assessment tool combined with an
> > introspection
> > >> of
> > >> >> >the user’s  ansible configuration should drive what ambari
> blueprint
> > >> type
> > >> >> >and configuration should be used when the cluster is spun up and
> the
> > >> storm
> > >> >> >topology is deployed.
> > >> >> >
> > >> >> >
> > >> >> >--
> > >> >> >George VetticadenPrincipal, COE
> > >> >> >[email protected]
> > >> >> >(630) 909-9138
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >On 4/13/16, 11:40 AM, "George Vetticaden" <
> > [email protected]
> > >> >
> > >> >> >wrote:
> > >> >> >
> > >> >> >>+ 1 to James suggestion.
> > >> >> >>We also need to consider not just the data volume and storage
> > >> >> requirements
> > >> >> >>for proper cluster sizing but also processing requirements as
> well.
> > >> Given
> > >> >> >>that in the new architecture, we have moved to single enrichment
> > >> topology
> > >> >> >>that will support all data sources, proper sizing of the
> enrichment
> > >> >> >>topology  will be even more crucial to maintain SLAs and HA
> > >> requirements.
> > >> >> >>The following key questions will apply to each parser topology
> and
> > >> single
> > >> >> >>enrichment topology
> > >> >> >>
> > >> >> >>1. Number of workers?
> > >> >> >>2. Number of workers per machine?
> > >> >> >>3. Size of each workers (in memory)?
> > >> >> >>4. Supervisor memory settings
> > >> >> >>
> > >> >> >>The assessment tool should also be used to size topologies
> > correctly
> > >> as
> > >> >> >>well.
> > >> >> >>
> > >> >> >>Tuning Kafka, Hbase and Solr/Elastic should also be governed by
> the
> > >> >> Metron
> > >> >> >>assessment tool.
> > >> >> >>
> > >> >> >>
> > >> >> >>--
> > >> >> >>George Vetticaden
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>On 4/13/16, 11:28 AM, "James Sirota" <[email protected]>
> > wrote:
> > >> >> >>
> > >> >> >>>Prior to adoption of Metron each adopting entity needs to
> > guesstimate
> > >> >> >>>it¹s data volume and data storage requirements so they can size
> > their
> > >> >> >>>cluster properly.  I propose a creation of an assessment tool
> that
> > >> can
> > >> >> >>>plug in to a Kafka topic for a given telemetry and over time
> > produce
> > >> >> >>>statistics for ingest volumes and storage requirement.  The idea
> > is
> > >> that
> > >> >> >>>prior to adoption of Metron someone can set up all the feeds and
> > >> kafka
> > >> >> >>>topics, but instead of deploying Metron right away they would
> > deploy
> > >> >> this
> > >> >> >>>tool.  This tool would then produce statistics for data
> > >> ingest/storage
> > >> >> >>>requirement, and all relevant information needed for cluster
> > sizing.
> > >> >> >>>
> > >> >> >>>Some of the metrics that can be recorded are:
> > >> >> >>>
> > >> >> >>>  *   Number of system events per second (average, max, mean,
> > >> standard
> > >> >> >>>dev)
> > >> >> >>>  *   Message size  (average, max, mean, standard dev)
> > >> >> >>>  *   Average number of peaks
> > >> >> >>>  *   Duration of peaks  (average, max, mean, standard dev)
> > >> >> >>>
> > >> >> >>>If the parser for a telemetry exist the tool can produce
> > additional
> > >> >> >>>statistics
> > >> >> >>>
> > >> >> >>>  *   Number of keys/fields parsed (average, max, mean, standard
> > dev)
> > >> >> >>>  *   Length of field parsed (average, max, mean, standard dev)
> > >> >> >>>  *   Length of key parsed (average, max, mean, standard dev)
> > >> >> >>>
> > >> >> >>>The tool can run for a week or a month and produce these kinds
> of
> > >> >> >>>statistics.  Then once the statistics are available we can come
> up
> > >> with
> > >> >> a
> > >> >> >>>guidance documentation of recommended cluster setup.  Otherwise
> > it¹s
> > >> >> hard
> > >> >> >>>to properly size a cluster and setup streaming parallelism not
> > >> knowing
> > >> >> >>>these metrics.
> > >> >> >>>
> > >> >> >>>
> > >> >> >>>Thoughts/ideas?
> > >> >> >>>
> > >> >> >>>Thanks,
> > >> >> >>>James
> > >> >> >>
> > >> >> >>
> > >> >> >
> > >> >>
> > >>
> > >
> > >
> > >
> > >--
> > >Nick Allen <[email protected]>
> >
>
>
>
> --
> Nick Allen <[email protected]>
>
-- 

Jon

Reply via email to