The problem is that we want something for storm that can work out of the box, ideally without some other complicated external service (except zookeeper which we already have, and is not actually that complex to setup and run). If we feel that we must have some external state store that is required for storm to run, then we need to make the decision carefully and deliberately. - Bobby
On Wednesday, March 23, 2016 8:37 AM, John Fang <xiaojian....@alibaba-inc.com> wrote: Sorry , I misunderstand it. We will make H/A for TopologyMaster. And metric meta will store at HDFS, So the metrics meta won't rely on the nimbus. It can enhance the stability of the metric system. -----邮件原件----- 发件人: Cody Innowhere [mailto:e.neve...@gmail.com] 发送时间: 2016年3月23日 19:59 收件人: dev@storm.apache.org 主题: Re: Question on Metrics Server to Alibaba team If we don't rely on any external system, our metrics system is still available but will store metrics meta/data in rocksdb on nimbus servers. There will be limits though, for example, we cannot store metrics data all through the topology lifecycle, because rocksdb is only a KV storage, it may not support efficient scan operations and too much data in local disk may bring in extra IO overhead, so we may have to store latest 1hour of m1 data, 6 hours of m10 data as such (currently not implemented in JStorm, but quite easy to do this). TopologyMaster is merely a channel for registering/computing/uploading metrics to nimbus, so if a TM goes down, the topology metrics will be unavailable for a while before it gets pulled up somewhere else(for a normal failover case, this should be very fast), while supervisor/nimbus metrics are unaffected as they're sent to nimbus via thrift interface. As long as TM is back, the topology metrics will be available again. Currently JStorm does sync metrics meta but metrics data between multiple nimbus serers is not synced. So under a nimbus failure, possibly we may lose some metrics data. On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <kabh...@gmail.com> wrote: > John, > > My concern is H/A of metrics on Storm by default. (I'm not 100% sure > Bobby pointed out same things.) > > Since Apache Storm has been used by various users so that we can't > assume that users have knowledges of external systems (including > Hadoop ecosystem, personal opinion) and operate them smoothly. > It reminds me about the importance to keep in mind about default. > > Therefore, I'm curious that new metrics feature of JStom can work > smoothly without external system (HBase / OTS). And love to see it > supports H/A without other systems, or users have to tolerate lost of > metrics for some scenarios. > > I guess this may be valid questions on H/A (as far as my understanding > of design doc is right): How metrics work when TopologyMaster is down? > And how metrics work when failover of Nimbus occurs? > > Personally I don't mind losing metrics for short durations (just want > to check availability of H/A), but failure shouldn't mess up whole metrics. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 2016년 3월 23일 (수) 오후 3:39, John Fang <xiaojian....@alibaba-inc.com>님이 작성: > > > @ Bobby Evans Jstorm code has experienced a lot of tests over the > > past > few > > years, espatially HA and scalability. We have done a lot of > > optimization about Metrics. The performance is better than Flink in > > my tests. In my personal opinion, the metric in jstorm offers very > > much informations. And the metric can tell us where is the bottleneck when > > we run a topology. > The > > performance bottleneck maybe serialize/deserialize/netty/executor > > and so on. Of course, I also has some other good monitoring in the > > world. So I hope we can choice the better monitoring before phrase > > 2. And I will > start > > study the Alas. If it is better, I am pleasured to redesign the > > metric by Alas. > > -----邮件原件----- > > 发件人: Bobby Evans [mailto:ev...@yahoo-inc.com.INVALID] > > 发送时间: 2016年3月22日 22:36 > > 收件人: dev@storm.apache.org > > 主题: Re: Question on Metrics Server to Alibaba team > > > > My personal opinion is that we should not reinvent the wheel (aka > > distributed fault tolerant metrics) ourselves. The local file > > blobstore with nimbus HA was a big enough pain to write and it is > > relatively simple in comparison. > > If the JStorm code is simple and offers everything we need in terms > > of HA and scalability then I would be OK with it, but if it doesn't > > I would > lean > > towards a different compatible open source solution. > > > > https://github.com/Netflix/atlas > > looks very promising as a default option. It is actively maintained > > by a group that I think has some of the best monitoring in the > > world. And it > is > > both java and apache compatible. It has no histogram support that I > could > > find, but that I don't see as being super critical. The biggest > > drawback is there is little documentation on how to use it, to > > really be able to evaluate it for our needs. - Bobby > > > > On Monday, March 21, 2016 7:29 PM, Jungtaek Lim > > <kabh...@gmail.com> > > wrote: > > > > > > Harsha, > > > > That's why I think new metric feature of JStorm looks promising. > > > > According to design doc on > > https://issues.apache.org/jira/browse/STORM-1329, > > there's no distinction between topology stat (which Apache Storm > > includes to worker heartbeat) and built-in metrics (which should be > > handled with separate consumer, as you stated). > > All metrics are passed to Nimbus and Nimbus cached metrics, which > > implies we can treat all metrics as same, and we can also provide > > built-in > metrics > > (including custom metrics) to users via REST API, too. > > > > I thought about standalone metrics server process which handles > > whole metric works (maybe TopologyMaster + Nimbus on design doc), > > but if > current > > implementation of metric feature on JStorm can take care of what I'm > > assuming, I guess it's great enough. > > > > Since I don't know about TopologyMaster, I just wonder that there're > > any SPOFs (including soft) and how metrics work when if component of > > SPOF > goes > > down. > > Since Cody gives digging point to take a look at, we can evaluate > > that feature before phase 2. > > > > Thanks, > > Jungtaek Lim (HeartSaVioR) > > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성: > > > > > One of the goals of this work and probably can be addressed in > > > separate jira is how the topology metrics reporter works. Today > > > its a bolt thats part of a topology graph that means its another > > > node in the Topology DAG that needs be tuned for better > > > performance. Some of our users took performance hits by deploying > > > topology metrics reporter that can send metrics to Ganglia. > > > Ideally this collection should be asynchronous and not be a node in > > > topology DAG. > > > > > > Shipping default metrics server and along with pluggable option > > > for users who wants to graphite or other timeline servers should > > > be the goal. > > > > > > --Harsha > > > > > > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote: > > > > @Cody - The design looks good. Does the design allow to > > > > aggregate metrics at the task/executor level? Basically, number > > > > of distinct metrics is proportional to the number of distinct > > > > tasks, did you ever run into such a use case? > > > > > > > > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere > > > > <e.neve...@gmail.com> > > > > wrote: > > > > > > > > > Also, you can read the code from our latest release JStorm 2.1.1. > > > > > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere > > > > > <e.neve...@gmail.com> > > > > > wrote: > > > > > > > > > > > @Jungtaek, > > > > > > We did some tests on codahale metrics, compared to > > > > > > meters/histograms, counters are quite fast. So we mainly > > > > > > focused on the optimization of > > > > > meters > > > > > > and histograms (they are indeed very slow) including double > > > > > > sampling, changing the clock from ns (System.nanoTime) to > > > > > > ms, > etc. > > > > > > You can take a look at the > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" > > > > > > class of our sequence-split-merge example code, as the > > > > > > client code entry to > > > metrics. > > > > > > After that, you may dig to TopologyMaster class, which is > > > > > > still part > > > of a > > > > > > topology, and then to TopologyMetricsRunnable, which is a > > > > > > part of > > > nimbus > > > > > > server, finally to MetricUploader plugin, this is where the > > > > > > metrics interfere with our "metrics server". Still, there're > > > > > > some nits in the > > > > > code, > > > > > > but I think that should be no big problem. > > > > > > > > > > > > I'd also like to point out that our "metrics server" is not > > > > > > strictly > > > a > > > > > > real metrics server, since most of the duty lies on nimbus > > > > > > server and topology master, it's more appropriate to call it > > metrics storage. > > > The > > > > > main > > > > > > reason for this is that we don't want to make a heavy-weight > > > > > > metrics > > > > > server > > > > > > out of JStorm, and this makes us very easy to maintain (we > > > > > > have teams > > > > > that > > > > > > specifically maintain HBase/OTS in Alibaba since they're so > > > > > > commonly > > > used > > > > > > in production). > > > > > > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim > > > > > > <kabh...@gmail.com> > > > > > wrote: > > > > > > > > > > > >> Thanks Cody and Bobby for the explanation. > > > > > >> > > > > > >> Cody, > > > > > >> I took a look at design doc and looks promising, especially > > > > > >> it > > > doesn't > > > > > do > > > > > >> sampling when metric type is 'counter'. As far as I heard > > > > > >> (I didn't > > > try > > > > > >> it) > > > > > >> it becomes huge performance hit in Apache Storm when we > > > > > >> change > > > sample > > > > > rate > > > > > >> to 1.0. > > > > > >> Could you guide the entry point of metric feature in JStorm > > > > > >> to dig > > > into? > > > > > >> > > > > > >> And just a curiosity, did you consider extracting metric > > > > > >> feature > > > (which > > > > > is > > > > > >> done with TopologyMasters and Nimbuses) into separate component? > > > > > >> I understood your mention to 'metrics server' as separate > > > component, but > > > > > >> after seeing design doc, feature seems to be implemented on > > Nimbus. > > > > > >> > > > > > >> Thanks, > > > > > >> Jungtaek Lim (HeartSaVioR) > > > > > >> > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere > > > > > >> <e.neve...@gmail.com>님이 > > > 작성: > > > > > >> > > > > > >> > JStorm has provided a MetricUploader interface, which is > > > > > >> > similar > > > to > > > > > >> > IMetricsConsumer in storm, and the underlying > > > > > >> > implementation is > > > > > >> pluggable, > > > > > >> > you can use HBase, or any other KV store that supports > > > > > >> > timeline > > > > > queries > > > > > >> or > > > > > >> > even a database(maybe for it's a small cluster). We > > > > > >> > provide model > > > > > >> classes > > > > > >> > in jstorm-core, as to what kinds of metrics data need to > > > > > >> > be > > > stored, > > > > > it's > > > > > >> > totally up to the detailed implementation. Our internal > > > implementation > > > > > >> uses > > > > > >> > OTS, which is a product of aliyun ( > > > > > https://www.aliyun.com/product/ots/ > > > > > >> ), > > > > > >> > but it's easy to adapt to other implementations. > > > > > >> > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans > > > > > >> <ev...@yahoo-inc.com.invalid > > > > > >> > > > > > > > >> > wrote: > > > > > >> > > > > > > >> > > Yes we originally wanted to try and use the Hadoop > > > > > >> > > Timeline > > > Server > > > > > for > > > > > >> > > storm metrics feedback to nimbus + UI + history like server. > > > But it > > > > > >> was > > > > > >> > > not stable at the time, so we stopped. For the sake of > > > > > >> > > playing > > > > > nicely > > > > > >> > with > > > > > >> > > the rest of the big data ecosystem I would like to see > > > > > >> > > us > > > support it > > > > > >> as > > > > > >> > an > > > > > >> > > option for metrics collection/query, but until the > > > > > >> > > timeline > > > server > > > > > v2 > > > > > >> is > > > > > >> > > ready and released. For me the important thing is that > > > > > >> > > we have > > > a > > > > > >> decent > > > > > >> > > time series DB that comes with storm by default and is > > > pluggable so > > > > > we > > > > > >> > can > > > > > >> > > replace it with something else that has similar > > > > > >> > > capabilities in > > > the > > > > > >> > future. > > > > > >> > > - Bobby > > > > > >> > > > > > > > >> > > On Friday, March 18, 2016 10:39 AM, Cody Innowhere < > > > > > >> > >e.neve...@gmail.com> wrote: > > > > > >> > > > > > > > >> > > > > > > > >> > > It's actually in Phase 2 of porting JStorm, but I'm > > > > > >> > >absolutely > > > ok > > > > > to > > > > > >> > > discuss this in advance. > > > > > >> > > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere < > > > > > e.neve...@gmail.com > > > > > >> > > > > > > >> > > wrote: > > > > > >> > > > > > > > >> > > > Yes it's already in production. > > > > > >> > > > The implementation basically follows the design > > > > > >> > > > document in > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, you > > > > > >> > > > can > > > take a > > > > > >> look > > > > > >> > > > first and feel free to ask questions. > > > > > >> > > > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim < > > > kabh...@gmail.com > > > > > > > > > > > >> > > wrote: > > > > > >> > > > > > > > > >> > > >> Hi, > > > > > >> > > >> > > > > > >> > > >> I got something to do with metrics so I'm seeking > > > > > >> > > >> the pull > > > > > requests > > > > > >> > > which > > > > > >> > > >> addresses metrics. > > > > > >> > > >> And at #753 > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I > > > found > > > > > >> Cody > > > > > >> > > said > > > > > >> > > >> we > > > > > >> > > >> (maybe it means Alibaba team) are currently working > > > > > >> > > >> on > > > Metrics > > > > > >> Server. > > > > > >> > > >> (I also found comment which said there was some talk > > > > > >> > > >> while > > > ago > > > > > >> around > > > > > >> > > >> integrating Hadoop timeline server. Seems like no > > > > > >> > > >> one came up > > > > > with > > > > > >> the > > > > > >> > > >> result, and I prefer to avoid big dependency so I'm > > > > > >> > > >> in favor > > > of > > > > > >> > Metrics > > > > > >> > > >> Server for now.) > > > > > >> > > >> > > > > > >> > > >> I think that would improve metrics feature of Storm > > > > > >> > > >> much > > > better, > > > > > so > > > > > >> > I'd > > > > > >> > > >> like to see how the work is going. Sure it's only > > > > > >> > > >> when > > > there's no > > > > > >> > issue > > > > > >> > > >> for > > > > > >> > > >> you to work transparently. I just would like to > > > > > >> > > >> prevent > > > > > >> duplication of > > > > > >> > > >> work, and would like to help if needed and possible. > > > > > >> > > >> > > > > > >> > > >> Thanks, > > > > > >> > > >> Jungtaek Lim (HeartSaVioR) > > > > > >> > > >> > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > Abhishek Agarwal > > > > > > > > > > > >