答复: 答复: Question on Metrics Server to Alibaba team

John Fang Tue, 29 Mar 2016 19:12:08 -0700

@Harsha If we not depend on external storage such as hdfs, we can depend on the 
RockDB.


-----邮件原件-----
发件人: Harsha [mailto:[email protected]] 
发送时间: 2016年3月30日 9:36
收件人: [email protected]
主题: Re: 答复: Question on Metrics Server to Alibaba team

Another thing to consider is to store a time series data not the current 
approach where we store 1min, 10min, 3hrs windowed approach and definitely not 
depend on external storage such as hdfs .

On Fri, Mar 25, 2016, at 06:43 AM, Bobby Evans wrote:
> My concern is really around how much time/effort it is to get to a 
> final solution, and to ultimately maintain/support that solution.  If 
> I was doing this from scratch I would probably pull something off of 
> the shelf that is tested and has an entire community supporting it 
> instead of writing something ourselves from scratch.  But in this case 
> we have a solution from JStorm, that we know works.  Because this is 
> the backend that we are talking about we can switch things out later 
> on if we need to.  Like I said before I am fine with using the JStorm code 
> initially.
> I mostly want to be sure of a few things.
> 1. The metrics interface we expose to end users is well thought out 
> and can be extended in the future.2. The interfaces that connect this 
> front end to the back end are though out and we could replace the back 
> end if needed.3. The solution offers some level of high availability.  
> If Nimbus a worker, etc. crash it is OK to lose some data, but we 
> don't want to
>  - Bobby
> 
>     On Friday, March 25, 2016 6:26 AM, Cody Innowhere
>     <[email protected]> wrote:
>  
> 
>  Bobby,
> I understand your concern. Still, I think our metrics design in JStorm 
> can work without any external service, as I mentioned above, we can 
> store metrics in rocksdb on nimbus server. A rough thought will be: we 
> store the latest 1 hour of 1-min window data, 10 hours of 10-min 
> window data, 5 days of 2-hour window data, 30 days of 1-day window, 
> etc. And if there's the need to sync metrics data between nimbus 
> servers, we can add a sync thread to handle nimbus fail-over, since 
> it's just metrics data that don't really matter too much, we can use a 
> plain simple sync model.
> 
> The external service is another option to end users, if users feel 
> it's important (or maybe their business built on top of storm is very 
> important), they can use this external service to build their own 
> monitor system which can be more useful than the original solution 
> shipped with storm.
> 
> On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans 
> <[email protected]>
> wrote:
> 
> > The problem is that we want something for storm that can work out of 
> >the  box, ideally without some other complicated external service 
> >(except  zookeeper which we already have, and is not actually that 
> >complex to setup  and run).
> > If we feel that we must have some external state store that is 
> >required  for storm to run, then we need to make the decision 
> >carefully and  deliberately.
> >  - Bobby
> >
> >    On Wednesday, March 23, 2016 8:37 AM, John Fang <  
> >[email protected]> wrote:
> >
> >
> >  Sorry , I misunderstand it. We will make H/A for TopologyMaster. 
> >And  metric meta will store at HDFS,  So the metrics meta won't rely 
> >on the  nimbus. It can enhance the stability of the metric system.
> >
> > -----邮件原件-----
> > 发件人: Cody Innowhere [mailto:[email protected]]
> > 发送时间: 2016年3月23日 19:59
> > 收件人: [email protected]
> > 主题: Re: Question on Metrics Server to Alibaba team
> >
> > If we don't rely on any external system, our metrics system is still 
> > available but will store metrics meta/data in rocksdb on nimbus servers.
> > There will be limits though, for example, we cannot store metrics 
> > data all through the topology lifecycle, because rocksdb is only a 
> > KV storage, it may not support efficient scan operations and too 
> > much data in local disk may bring in extra IO overhead, so we may 
> > have to store latest 1hour of m1 data, 6 hours of m10 data as such 
> > (currently not implemented in JStorm, but quite easy to do this).
> >
> > TopologyMaster is merely a channel for 
> > registering/computing/uploading metrics to nimbus, so if a TM goes 
> > down, the topology metrics will be unavailable for a while before it 
> > gets pulled up somewhere else(for a normal failover case, this 
> > should be very fast), while supervisor/nimbus metrics are unaffected 
> > as they're sent to nimbus via thrift interface. As long as TM is back, the 
> > topology metrics will be available again.
> >
> > Currently JStorm does sync metrics meta but metrics data between 
> > multiple nimbus serers is not synced. So under a nimbus failure, 
> > possibly we may lose some metrics data.
> >
> >
> > On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <[email protected]> wrote:
> >
> > > John,
> > >
> > > My concern is H/A of metrics on Storm by default. (I'm not 100% 
> > > sure Bobby pointed out same things.)
> > >
> > > Since Apache Storm has been used by various users so that we can't 
> > > assume that users have knowledges of external systems (including 
> > > Hadoop ecosystem, personal opinion) and operate them smoothly.
> > > It reminds me about the importance to keep in mind about default.
> > >
> > > Therefore, I'm curious that new metrics feature of JStom can work 
> > > smoothly without external system (HBase / OTS). And love to see it 
> > > supports H/A without other systems, or users have to tolerate lost 
> > > of metrics for some scenarios.
> > >
> > > I guess this may be valid questions on H/A (as far as my 
> > > understanding of design doc is right): How metrics work when 
> > > TopologyMaster is down?
> > > And how metrics work when failover of Nimbus occurs?
> > >
> > > Personally I don't mind losing metrics for short durations (just 
> > > want to check availability of H/A), but failure shouldn't mess up 
> > > whole
> > metrics.
> > >
> > > Thanks,
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 2016년 3월 23일 (수) 오후 3:39, John Fang <[email protected]>님이 작성:
> > >
> > > > @ Bobby Evans Jstorm code has experienced a lot of tests over 
> > > > the past
> > > few
> > > > years, espatially HA and scalability. We have done a lot of 
> > > > optimization about Metrics. The performance is better than Flink 
> > > > in my tests. In my personal opinion, the metric in jstorm offers 
> > > > very much informations. And the metric can tell us where is the 
> > > > bottleneck
> > when we run a topology.
> > > The
> > > > performance bottleneck maybe 
> > > > serialize/deserialize/netty/executor
> > > > and so on. Of course, I also has some other good monitoring in 
> > > > the world. So I hope we can choice the better monitoring before 
> > > > phrase 2. And I will
> > > start
> > > > study the Alas. If it is better, I am pleasured to redesign the 
> > > > metric by Alas.
> > > > -----邮件原件-----
> > > > 发件人: Bobby Evans [mailto:[email protected]]
> > > > 发送时间: 2016年3月22日 22:36
> > > > 收件人: [email protected]
> > > > 主题: Re: Question on Metrics Server to Alibaba team
> > > >
> > > > My personal opinion is that we should not reinvent the wheel 
> > > > (aka distributed fault tolerant metrics) ourselves.  The local 
> > > > file blobstore with nimbus HA was a big enough pain to write and 
> > > > it is relatively simple in comparison.
> > > > If the JStorm code is simple and offers everything we need in 
> > > > terms of HA and scalability then I would be OK with it, but if 
> > > > it doesn't I would
> > > lean
> > > > towards a different compatible open source solution.
> > > >
> > > > https://github.com/Netflix/atlas looks very promising as a 
> > > > default option.  It is actively maintained by a group that I 
> > > > think has some of the best monitoring in the world.  And it
> > > is
> > > > both java and apache compatible.  It has no histogram support 
> > > > that I
> > > could
> > > > find, but that I don't see as being super critical.  The biggest 
> > > > drawback is there is little documentation on how to use it, to 
> > > > really be able to evaluate it for our needs. - Bobby
> > > >
> > > >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim  
> > > ><[email protected]>
> > > > wrote:
> > > >
> > > >
> > > >  Harsha,
> > > >
> > > > That's why I think new metric feature of JStorm looks promising.
> > > >
> > > > According to design doc on
> > > > https://issues.apache.org/jira/browse/STORM-1329,
> > > > there's no distinction between topology stat (which Apache Storm 
> > > > includes to worker heartbeat) and built-in metrics (which should 
> > > > be handled with separate consumer, as you stated).
> > > > All metrics are passed to Nimbus and Nimbus cached metrics, 
> > > > which implies we can treat all metrics as same, and we can also 
> > > > provide built-in
> > > metrics
> > > > (including custom metrics) to users via REST API, too.
> > > >
> > > > I thought about standalone metrics server process which handles 
> > > > whole metric works (maybe TopologyMaster + Nimbus on design 
> > > > doc), but if
> > > current
> > > > implementation of metric feature on JStorm can take care of what 
> > > > I'm assuming, I guess it's great enough.
> > > >
> > > > Since I don't know about TopologyMaster, I just wonder that 
> > > > there're any SPOFs (including soft) and how metrics work when if 
> > > > component of SPOF
> > > goes
> > > > down.
> > > > Since Cody gives digging point to take a look at, we can 
> > > > evaluate that feature before phase 2.
> > > >
> > > > Thanks,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <[email protected]>님이 작성:
> > > >
> > > > > One of the goals of this work and probably can be addressed in 
> > > > > separate jira is how the topology metrics reporter works. 
> > > > > Today its a bolt thats part of a topology graph that means its 
> > > > > another node in the Topology DAG that needs be tuned for 
> > > > > better performance. Some of our users took performance hits by 
> > > > > deploying topology metrics reporter that can send metrics to Ganglia.
> > > > > Ideally this collection should be asynchronous and not be a 
> > > > > node in
> > topology DAG.
> > > > >
> > > > > Shipping default metrics server and along with pluggable 
> > > > > option for users who wants to graphite or other timeline 
> > > > > servers should be the goal.
> > > > >
> > > > > --Harsha
> > > > >
> > > > >
> > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > > > @Cody - The design looks good. Does the design allow to 
> > > > > > aggregate metrics at the task/executor level? Basically, 
> > > > > > number of distinct metrics is proportional to the number of 
> > > > > > distinct tasks, did you ever run into such a use case?
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere 
> > > > > > <[email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Also, you can read the code from our latest release JStorm 2.1.1.
> > > > > > >
> > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere 
> > > > > > > <[email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > @Jungtaek,
> > > > > > > > We did some tests on codahale metrics, compared to 
> > > > > > > > meters/histograms, counters are quite fast. So we mainly 
> > > > > > > > focused on the optimization of
> > > > > > > meters
> > > > > > > > and histograms (they are indeed very slow) including 
> > > > > > > > double sampling, changing the clock from ns 
> > > > > > > > (System.nanoTime) to ms,
> > > etc.
> > > > > > > > You can take a look at the 
> > > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
> > > > > > > > class of our sequence-split-merge example code, as the 
> > > > > > > > client code entry to
> > > > > metrics.
> > > > > > > > After that, you may dig to TopologyMaster class, which 
> > > > > > > > is still part
> > > > > of a
> > > > > > > > topology, and then to TopologyMetricsRunnable, which is 
> > > > > > > > a part of
> > > > > nimbus
> > > > > > > > server, finally to MetricUploader plugin, this is where 
> > > > > > > > the metrics interfere with our "metrics server". Still, 
> > > > > > > > there're some nits in the
> > > > > > > code,
> > > > > > > > but I think that should be no big problem.
> > > > > > > >
> > > > > > > > I'd also like to point out that our "metrics server" is 
> > > > > > > > not strictly
> > > > > a
> > > > > > > > real metrics server, since most of the duty lies on 
> > > > > > > > nimbus server and topology master, it's more appropriate 
> > > > > > > > to call it
> > > > metrics storage.
> > > > > The
> > > > > > > main
> > > > > > > > reason for this is that we don't want to make a 
> > > > > > > > heavy-weight metrics
> > > > > > > server
> > > > > > > > out of JStorm, and this makes us very easy to maintain 
> > > > > > > > (we have teams
> > > > > > > that
> > > > > > > > specifically maintain HBase/OTS in Alibaba since they're 
> > > > > > > > so commonly
> > > > > used
> > > > > > > > in production).
> > > > > > > >
> > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim 
> > > > > > > > <[email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > > > >>
> > > > > > > >> Cody,
> > > > > > > >> I took a look at design doc and looks promising, 
> > > > > > > >> especially it
> > > > > doesn't
> > > > > > > do
> > > > > > > >> sampling when metric type is 'counter'. As far as I 
> > > > > > > >> heard (I didn't
> > > > > try
> > > > > > > >> it)
> > > > > > > >> it becomes huge performance hit in Apache Storm when we 
> > > > > > > >> change
> > > > > sample
> > > > > > > rate
> > > > > > > >> to 1.0.
> > > > > > > >> Could you guide the entry point of metric feature in 
> > > > > > > >> JStorm to dig
> > > > > into?
> > > > > > > >>
> > > > > > > >> And just a curiosity, did you consider extracting 
> > > > > > > >> metric feature
> > > > > (which
> > > > > > > is
> > > > > > > >> done with TopologyMasters and Nimbuses) into separate
> > component?
> > > > > > > >> I understood your mention to 'metrics server' as 
> > > > > > > >> separate
> > > > > component, but
> > > > > > > >> after seeing design doc, feature seems to be 
> > > > > > > >> implemented on
> > > > Nimbus.
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > >>
> > > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere 
> > > > > > > >> <[email protected]>님이
> > > > > 작성:
> > > > > > > >>
> > > > > > > >> > JStorm has provided a MetricUploader interface, which 
> > > > > > > >> > is similar
> > > > > to
> > > > > > > >> > IMetricsConsumer in storm, and the underlying 
> > > > > > > >> > implementation is
> > > > > > > >> pluggable,
> > > > > > > >> > you can use HBase, or any other KV store that 
> > > > > > > >> > supports timeline
> > > > > > > queries
> > > > > > > >> or
> > > > > > > >> > even a database(maybe for it's a small cluster). We 
> > > > > > > >> > provide model
> > > > > > > >> classes
> > > > > > > >> > in jstorm-core, as to what kinds of metrics data need 
> > > > > > > >> > to be
> > > > > stored,
> > > > > > > it's
> > > > > > > >> > totally up to the detailed implementation. Our 
> > > > > > > >> > internal
> > > > > implementation
> > > > > > > >> uses
> > > > > > > >> > OTS, which is a product of aliyun (
> > > > > > > https://www.aliyun.com/product/ots/
> > > > > > > >> ),
> > > > > > > >> > but it's easy to adapt to other implementations.
> > > > > > > >> >
> > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > > > >> <[email protected]
> > > > > > > >> > >
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Yes we originally wanted to try and use the Hadoop 
> > > > > > > >> > > Timeline
> > > > > Server
> > > > > > > for
> > > > > > > >> > > storm metrics feedback to nimbus + UI + history 
> > > > > > > >> > > like
> > server.
> > > > > But it
> > > > > > > >> was
> > > > > > > >> > > not stable at the time, so we stopped.  For the 
> > > > > > > >> > > sake of playing
> > > > > > > nicely
> > > > > > > >> > with
> > > > > > > >> > > the rest of the big data ecosystem I would like to 
> > > > > > > >> > > see us
> > > > > support it
> > > > > > > >> as
> > > > > > > >> > an
> > > > > > > >> > > option for metrics collection/query, but until the 
> > > > > > > >> > > timeline
> > > > > server
> > > > > > > v2
> > > > > > > >> is
> > > > > > > >> > > ready and released.  For me the important thing is 
> > > > > > > >> > > that we have
> > > > > a
> > > > > > > >> decent
> > > > > > > >> > > time series DB that comes with storm by default and 
> > > > > > > >> > > is
> > > > > pluggable so
> > > > > > > we
> > > > > > > >> > can
> > > > > > > >> > > replace it with something else that has similar 
> > > > > > > >> > > capabilities in
> > > > > the
> > > > > > > >> > future.
> > > > > > > >> > >  - Bobby
> > > > > > > >> > >
> > > > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody 
> > > > > > > >> > >Innowhere < [email protected]> wrote:
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >  It's actually in Phase 2 of porting JStorm, but 
> > > > > > > >> > >I'm absolutely
> > > > > ok
> > > > > > > to
> > > > > > > >> > > discuss this in advance.
> > > > > > > >> > >
> > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > > > [email protected]
> > > > > > > >> >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Yes it's already in production.
> > > > > > > >> > > > The implementation basically follows the design 
> > > > > > > >> > > > document in 
> > > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, 
> > > > > > > >> > > > you can
> > > > > take a
> > > > > > > >> look
> > > > > > > >> > > > first and feel free to ask questions.
> > > > > > > >> > > >
> > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > > > [email protected]
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > >> Hi,
> > > > > > > >> > > >>
> > > > > > > >> > > >> I got something to do with metrics so I'm 
> > > > > > > >> > > >> seeking the pull
> > > > > > > requests
> > > > > > > >> > > which
> > > > > > > >> > > >> addresses metrics.
> > > > > > > >> > > >> And at #753
> > > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > > > found
> > > > > > > >> Cody
> > > > > > > >> > > said
> > > > > > > >> > > >> we
> > > > > > > >> > > >> (maybe it means Alibaba team) are currently 
> > > > > > > >> > > >> working on
> > > > > Metrics
> > > > > > > >> Server.
> > > > > > > >> > > >> (I also found comment which said there was some 
> > > > > > > >> > > >> talk while
> > > > > ago
> > > > > > > >> around
> > > > > > > >> > > >> integrating Hadoop timeline server. Seems like 
> > > > > > > >> > > >> no one came up
> > > > > > > with
> > > > > > > >> the
> > > > > > > >> > > >> result, and I prefer to avoid big dependency so 
> > > > > > > >> > > >> I'm in favor
> > > > > of
> > > > > > > >> > Metrics
> > > > > > > >> > > >> Server for now.)
> > > > > > > >> > > >>
> > > > > > > >> > > >> I think that would improve metrics feature of 
> > > > > > > >> > > >> Storm much
> > > > > better,
> > > > > > > so
> > > > > > > >> > I'd
> > > > > > > >> > > >> like to see how the work is going. Sure it's 
> > > > > > > >> > > >> only when
> > > > > there's no
> > > > > > > >> > issue
> > > > > > > >> > > >> for
> > > > > > > >> > > >> you to work transparently. I just would like to 
> > > > > > > >> > > >> prevent
> > > > > > > >> duplication of
> > > > > > > >> > > >> work, and would like to help if needed and possible.
> > > > > > > >> > > >>
> > > > > > > >> > > >> Thanks,
> > > > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > >> > > >>
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Abhishek Agarwal
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> >
> 
>

答复: 答复: Question on Metrics Server to Alibaba team

Reply via email to