Re: 答复: Question on Metrics Server to Alibaba team

Bobby Evans Thu, 24 Mar 2016 11:09:51 -0700

The problem is that we want something for storm that can work out of the box, 
ideally without some other complicated external service (except zookeeper which 
we already have, and is not actually that complex to setup and run).
If we feel that we must have some external state store that is required for 
storm to run, then we need to make the decision carefully and deliberately.
 - Bobby


    On Wednesday, March 23, 2016 8:37 AM, John Fang 
<[email protected]> wrote:
 

 Sorry , I misunderstand it. We will make H/A for TopologyMaster. And metric 
meta will store at HDFS,  So the metrics meta won't rely on the nimbus. It can 
enhance the stability of the metric system.    

-----邮件原件-----
发件人: Cody Innowhere [mailto:[email protected]] 
发送时间: 2016年3月23日 19:59
收件人: [email protected]
主题: Re: Question on Metrics Server to Alibaba team

If we don't rely on any external system, our metrics system is still available 
but will store metrics meta/data in rocksdb on nimbus servers.
There will be limits though, for example, we cannot store metrics data all 
through the topology lifecycle, because rocksdb is only a KV storage, it may 
not support efficient scan operations and too much data in local disk may bring 
in extra IO overhead, so we may have to store latest 1hour of m1 data, 6 hours 
of m10 data as such (currently not implemented in JStorm, but quite easy to do 
this).

TopologyMaster is merely a channel for registering/computing/uploading metrics 
to nimbus, so if a TM goes down, the topology metrics will be unavailable for a 
while before it gets pulled up somewhere else(for a normal failover case, this 
should be very fast), while supervisor/nimbus metrics are unaffected as they're 
sent to nimbus via thrift interface. As long as TM is back, the topology 
metrics will be available again.

Currently JStorm does sync metrics meta but metrics data between multiple 
nimbus serers is not synced. So under a nimbus failure, possibly we may lose 
some metrics data.


On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <[email protected]> wrote:

> John,
>
> My concern is H/A of metrics on Storm by default. (I'm not 100% sure 
> Bobby pointed out same things.)
>
> Since Apache Storm has been used by various users so that we can't 
> assume that users have knowledges of external systems (including 
> Hadoop ecosystem, personal opinion) and operate them smoothly.
> It reminds me about the importance to keep in mind about default.
>
> Therefore, I'm curious that new metrics feature of JStom can work 
> smoothly without external system (HBase / OTS). And love to see it 
> supports H/A without other systems, or users have to tolerate lost of 
> metrics for some scenarios.
>
> I guess this may be valid questions on H/A (as far as my understanding 
> of design doc is right): How metrics work when TopologyMaster is down? 
> And how metrics work when failover of Nimbus occurs?
>
> Personally I don't mind losing metrics for short durations (just want 
> to check availability of H/A), but failure shouldn't mess up whole metrics.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 3월 23일 (수) 오후 3:39, John Fang <[email protected]>님이 작성:
>
> > @ Bobby Evans Jstorm code has experienced a lot of tests over the 
> > past
> few
> > years, espatially HA and scalability. We have done a lot of 
> > optimization about Metrics. The performance is better than Flink in 
> > my tests. In my personal opinion, the metric in jstorm offers very 
> > much informations. And the metric can tell us where is the bottleneck when 
> > we run a topology.
> The
> > performance bottleneck maybe serialize/deserialize/netty/executor 
> > and so on. Of course, I also has some other good monitoring in the 
> > world. So I hope we can choice the better monitoring before phrase 
> > 2. And I will
> start
> > study the Alas. If it is better, I am pleasured to redesign the 
> > metric by Alas.
> > -----邮件原件-----
> > 发件人: Bobby Evans [mailto:[email protected]]
> > 发送时间: 2016年3月22日 22:36
> > 收件人: [email protected]
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > My personal opinion is that we should not reinvent the wheel (aka 
> > distributed fault tolerant metrics) ourselves.  The local file 
> > blobstore with nimbus HA was a big enough pain to write and it is 
> > relatively simple in comparison.
> > If the JStorm code is simple and offers everything we need in terms 
> > of HA and scalability then I would be OK with it, but if it doesn't 
> > I would
> lean
> > towards a different compatible open source solution.
> >
> > https://github.com/Netflix/atlas
> > looks very promising as a default option.  It is actively maintained 
> > by a group that I think has some of the best monitoring in the 
> > world.  And it
> is
> > both java and apache compatible.  It has no histogram support that I
> could
> > find, but that I don't see as being super critical.  The biggest 
> > drawback is there is little documentation on how to use it, to 
> > really be able to evaluate it for our needs. - Bobby
> >
> >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim 
> > <[email protected]>
> > wrote:
> >
> >
> >  Harsha,
> >
> > That's why I think new metric feature of JStorm looks promising.
> >
> > According to design doc on
> > https://issues.apache.org/jira/browse/STORM-1329,
> > there's no distinction between topology stat (which Apache Storm 
> > includes to worker heartbeat) and built-in metrics (which should be 
> > handled with separate consumer, as you stated).
> > All metrics are passed to Nimbus and Nimbus cached metrics, which 
> > implies we can treat all metrics as same, and we can also provide 
> > built-in
> metrics
> > (including custom metrics) to users via REST API, too.
> >
> > I thought about standalone metrics server process which handles 
> > whole metric works (maybe TopologyMaster + Nimbus on design doc), 
> > but if
> current
> > implementation of metric feature on JStorm can take care of what I'm 
> > assuming, I guess it's great enough.
> >
> > Since I don't know about TopologyMaster, I just wonder that there're 
> > any SPOFs (including soft) and how metrics work when if component of 
> > SPOF
> goes
> > down.
> > Since Cody gives digging point to take a look at, we can evaluate 
> > that feature before phase 2.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 3월 22일 (화) 오전 1:36, Harsha <[email protected]>님이 작성:
> >
> > > One of the goals of this work and probably can be addressed in 
> > > separate jira is how the topology metrics reporter works. Today 
> > > its a bolt thats part of a topology graph that means its another 
> > > node in the Topology DAG that needs be tuned for better 
> > > performance. Some of our users took performance hits by deploying 
> > > topology metrics reporter that can send metrics to Ganglia. 
> > > Ideally this collection should be asynchronous and not be a node in 
> > > topology DAG.
> > >
> > > Shipping default metrics server and along with pluggable option 
> > > for users who wants to graphite or other timeline servers should 
> > > be the goal.
> > >
> > > --Harsha
> > >
> > >
> > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > @Cody - The design looks good. Does the design allow to 
> > > > aggregate metrics at the task/executor level? Basically, number 
> > > > of distinct metrics is proportional to the number of distinct 
> > > > tasks, did you ever run into such a use case?
> > > >
> > > >
> > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > > > <[email protected]>
> > > > wrote:
> > > >
> > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > >
> > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > > > <[email protected]>
> > > > > wrote:
> > > > >
> > > > > > @Jungtaek,
> > > > > > We did some tests on codahale metrics, compared to 
> > > > > > meters/histograms, counters are quite fast. So we mainly 
> > > > > > focused on the optimization of
> > > > > meters
> > > > > > and histograms (they are indeed very slow) including double 
> > > > > > sampling, changing the clock from ns (System.nanoTime) to 
> > > > > > ms,
> etc.
> > > > > > You can take a look at the
> > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" 
> > > > > > class of our sequence-split-merge example code, as the 
> > > > > > client code entry to
> > > metrics.
> > > > > > After that, you may dig to TopologyMaster class, which is 
> > > > > > still part
> > > of a
> > > > > > topology, and then to TopologyMetricsRunnable, which is a 
> > > > > > part of
> > > nimbus
> > > > > > server, finally to MetricUploader plugin, this is where the 
> > > > > > metrics interfere with our "metrics server". Still, there're 
> > > > > > some nits in the
> > > > > code,
> > > > > > but I think that should be no big problem.
> > > > > >
> > > > > > I'd also like to point out that our "metrics server" is not 
> > > > > > strictly
> > > a
> > > > > > real metrics server, since most of the duty lies on nimbus 
> > > > > > server and topology master, it's more appropriate to call it
> > metrics storage.
> > > The
> > > > > main
> > > > > > reason for this is that we don't want to make a heavy-weight 
> > > > > > metrics
> > > > > server
> > > > > > out of JStorm, and this makes us very easy to maintain (we 
> > > > > > have teams
> > > > > that
> > > > > > specifically maintain HBase/OTS in Alibaba since they're so 
> > > > > > commonly
> > > used
> > > > > > in production).
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > > > <[email protected]>
> > > > > wrote:
> > > > > >
> > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > >>
> > > > > >> Cody,
> > > > > >> I took a look at design doc and looks promising, especially 
> > > > > >> it
> > > doesn't
> > > > > do
> > > > > >> sampling when metric type is 'counter'. As far as I heard 
> > > > > >> (I didn't
> > > try
> > > > > >> it)
> > > > > >> it becomes huge performance hit in Apache Storm when we 
> > > > > >> change
> > > sample
> > > > > rate
> > > > > >> to 1.0.
> > > > > >> Could you guide the entry point of metric feature in JStorm 
> > > > > >> to dig
> > > into?
> > > > > >>
> > > > > >> And just a curiosity, did you consider extracting metric 
> > > > > >> feature
> > > (which
> > > > > is
> > > > > >> done with TopologyMasters and Nimbuses) into separate component?
> > > > > >> I understood your mention to 'metrics server' as separate
> > > component, but
> > > > > >> after seeing design doc, feature seems to be implemented on
> > Nimbus.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >>
> > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > > > >> <[email protected]>님이
> > > 작성:
> > > > > >>
> > > > > >> > JStorm has provided a MetricUploader interface, which is 
> > > > > >> > similar
> > > to
> > > > > >> > IMetricsConsumer in storm, and the underlying 
> > > > > >> > implementation is
> > > > > >> pluggable,
> > > > > >> > you can use HBase, or any other KV store that supports 
> > > > > >> > timeline
> > > > > queries
> > > > > >> or
> > > > > >> > even a database(maybe for it's a small cluster). We 
> > > > > >> > provide model
> > > > > >> classes
> > > > > >> > in jstorm-core, as to what kinds of metrics data need to 
> > > > > >> > be
> > > stored,
> > > > > it's
> > > > > >> > totally up to the detailed implementation. Our internal
> > > implementation
> > > > > >> uses
> > > > > >> > OTS, which is a product of aliyun (
> > > > > https://www.aliyun.com/product/ots/
> > > > > >> ),
> > > > > >> > but it's easy to adapt to other implementations.
> > > > > >> >
> > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > >> <[email protected]
> > > > > >> > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Yes we originally wanted to try and use the Hadoop 
> > > > > >> > > Timeline
> > > Server
> > > > > for
> > > > > >> > > storm metrics feedback to nimbus + UI + history like server.
> > > But it
> > > > > >> was
> > > > > >> > > not stable at the time, so we stopped.  For the sake of 
> > > > > >> > > playing
> > > > > nicely
> > > > > >> > with
> > > > > >> > > the rest of the big data ecosystem I would like to see 
> > > > > >> > > us
> > > support it
> > > > > >> as
> > > > > >> > an
> > > > > >> > > option for metrics collection/query, but until the 
> > > > > >> > > timeline
> > > server
> > > > > v2
> > > > > >> is
> > > > > >> > > ready and released.  For me the important thing is that 
> > > > > >> > > we have
> > > a
> > > > > >> decent
> > > > > >> > > time series DB that comes with storm by default and is
> > > pluggable so
> > > > > we
> > > > > >> > can
> > > > > >> > > replace it with something else that has similar 
> > > > > >> > > capabilities in
> > > the
> > > > > >> > future.
> > > > > >> > >  - Bobby
> > > > > >> > >
> > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere < 
> > > > > >> > >[email protected]> wrote:
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm 
> > > > > >> > >absolutely
> > > ok
> > > > > to
> > > > > >> > > discuss this in advance.
> > > > > >> > >
> > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > [email protected]
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Yes it's already in production.
> > > > > >> > > > The implementation basically follows the design 
> > > > > >> > > > document in 
> > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you 
> > > > > >> > > > can
> > > take a
> > > > > >> look
> > > > > >> > > > first and feel free to ask questions.
> > > > > >> > > >
> > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > [email protected]
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > >> Hi,
> > > > > >> > > >>
> > > > > >> > > >> I got something to do with metrics so I'm seeking 
> > > > > >> > > >> the pull
> > > > > requests
> > > > > >> > > which
> > > > > >> > > >> addresses metrics.
> > > > > >> > > >> And at #753 
> > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > found
> > > > > >> Cody
> > > > > >> > > said
> > > > > >> > > >> we
> > > > > >> > > >> (maybe it means Alibaba team) are currently working 
> > > > > >> > > >> on
> > > Metrics
> > > > > >> Server.
> > > > > >> > > >> (I also found comment which said there was some talk 
> > > > > >> > > >> while
> > > ago
> > > > > >> around
> > > > > >> > > >> integrating Hadoop timeline server. Seems like no 
> > > > > >> > > >> one came up
> > > > > with
> > > > > >> the
> > > > > >> > > >> result, and I prefer to avoid big dependency so I'm 
> > > > > >> > > >> in favor
> > > of
> > > > > >> > Metrics
> > > > > >> > > >> Server for now.)
> > > > > >> > > >>
> > > > > >> > > >> I think that would improve metrics feature of Storm 
> > > > > >> > > >> much
> > > better,
> > > > > so
> > > > > >> > I'd
> > > > > >> > > >> like to see how the work is going. Sure it's only 
> > > > > >> > > >> when
> > > there's no
> > > > > >> > issue
> > > > > >> > > >> for
> > > > > >> > > >> you to work transparently. I just would like to 
> > > > > >> > > >> prevent
> > > > > >> duplication of
> > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > >> > > >>
> > > > > >> > > >> Thanks,
> > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > >> > > >>
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Abhishek Agarwal
> > >
> >
> >
> >
> >
>

Re: 答复: Question on Metrics Server to Alibaba team

Reply via email to