@Harsha, Currently we already use rocksdb to store a time series data rather than the latest window values.
@Bobby, I will think about HA and post a detailed document for review (together with MetricUploader interface) later. On Wed, Mar 30, 2016 at 9:35 AM, Harsha <st...@harsha.io> wrote: > Another thing to consider is to store a time series data not the current > approach where we store 1min, 10min, 3hrs windowed approach and > definitely not depend on external storage such as hdfs . > > On Fri, Mar 25, 2016, at 06:43 AM, Bobby Evans wrote: > > My concern is really around how much time/effort it is to get to a final > > solution, and to ultimately maintain/support that solution. If I was > > doing this from scratch I would probably pull something off of the shelf > > that is tested and has an entire community supporting it instead of > > writing something ourselves from scratch. But in this case we have a > > solution from JStorm, that we know works. Because this is the backend > > that we are talking about we can switch things out later on if we need > > to. Like I said before I am fine with using the JStorm code initially. > > I mostly want to be sure of a few things. > > 1. The metrics interface we expose to end users is well thought out and > > can be extended in the future.2. The interfaces that connect this front > > end to the back end are though out and we could replace the back end if > > needed.3. The solution offers some level of high availability. If Nimbus > > a worker, etc. crash it is OK to lose some data, but we don't want to > > - Bobby > > > > On Friday, March 25, 2016 6:26 AM, Cody Innowhere > > <e.neve...@gmail.com> wrote: > > > > > > Bobby, > > I understand your concern. Still, I think our metrics design in JStorm > > can > > work without any external service, as I mentioned above, we can store > > metrics in rocksdb on nimbus server. A rough thought will be: we store > > the > > latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5 > > days > > of 2-hour window data, 30 days of 1-day window, etc. And if there's the > > need to sync metrics data between nimbus servers, we can add a sync > > thread > > to handle nimbus fail-over, since it's just metrics data that don't > > really > > matter too much, we can use a plain simple sync model. > > > > The external service is another option to end users, if users feel it's > > important (or maybe their business built on top of storm is very > > important), they can use this external service to build their own monitor > > system which can be more useful than the original solution shipped with > > storm. > > > > On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans > > <ev...@yahoo-inc.com.invalid> > > wrote: > > > > > The problem is that we want something for storm that can work out of > the > > > box, ideally without some other complicated external service (except > > > zookeeper which we already have, and is not actually that complex to > setup > > > and run). > > > If we feel that we must have some external state store that is required > > > for storm to run, then we need to make the decision carefully and > > > deliberately. > > > - Bobby > > > > > > On Wednesday, March 23, 2016 8:37 AM, John Fang < > > > xiaojian....@alibaba-inc.com> wrote: > > > > > > > > > Sorry , I misunderstand it. We will make H/A for TopologyMaster. And > > > metric meta will store at HDFS, So the metrics meta won't rely on the > > > nimbus. It can enhance the stability of the metric system. > > > > > > -----邮件原件----- > > > 发件人: Cody Innowhere [mailto:e.neve...@gmail.com] > > > 发送时间: 2016年3月23日 19:59 > > > 收件人: dev@storm.apache.org > > > 主题: Re: Question on Metrics Server to Alibaba team > > > > > > If we don't rely on any external system, our metrics system is still > > > available but will store metrics meta/data in rocksdb on nimbus > servers. > > > There will be limits though, for example, we cannot store metrics data > all > > > through the topology lifecycle, because rocksdb is only a KV storage, > it > > > may not support efficient scan operations and too much data in local > disk > > > may bring in extra IO overhead, so we may have to store latest 1hour > of m1 > > > data, 6 hours of m10 data as such (currently not implemented in > JStorm, but > > > quite easy to do this). > > > > > > TopologyMaster is merely a channel for registering/computing/uploading > > > metrics to nimbus, so if a TM goes down, the topology metrics will be > > > unavailable for a while before it gets pulled up somewhere else(for a > > > normal failover case, this should be very fast), while > supervisor/nimbus > > > metrics are unaffected as they're sent to nimbus via thrift interface. > As > > > long as TM is back, the topology metrics will be available again. > > > > > > Currently JStorm does sync metrics meta but metrics data between > multiple > > > nimbus serers is not synced. So under a nimbus failure, possibly we may > > > lose some metrics data. > > > > > > > > > On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <kabh...@gmail.com> > wrote: > > > > > > > John, > > > > > > > > My concern is H/A of metrics on Storm by default. (I'm not 100% sure > > > > Bobby pointed out same things.) > > > > > > > > Since Apache Storm has been used by various users so that we can't > > > > assume that users have knowledges of external systems (including > > > > Hadoop ecosystem, personal opinion) and operate them smoothly. > > > > It reminds me about the importance to keep in mind about default. > > > > > > > > Therefore, I'm curious that new metrics feature of JStom can work > > > > smoothly without external system (HBase / OTS). And love to see it > > > > supports H/A without other systems, or users have to tolerate lost of > > > > metrics for some scenarios. > > > > > > > > I guess this may be valid questions on H/A (as far as my > understanding > > > > of design doc is right): How metrics work when TopologyMaster is > down? > > > > And how metrics work when failover of Nimbus occurs? > > > > > > > > Personally I don't mind losing metrics for short durations (just want > > > > to check availability of H/A), but failure shouldn't mess up whole > > > metrics. > > > > > > > > Thanks, > > > > Jungtaek Lim (HeartSaVioR) > > > > > > > > 2016년 3월 23일 (수) 오후 3:39, John Fang <xiaojian....@alibaba-inc.com>님이 > 작성: > > > > > > > > > @ Bobby Evans Jstorm code has experienced a lot of tests over the > > > > > past > > > > few > > > > > years, espatially HA and scalability. We have done a lot of > > > > > optimization about Metrics. The performance is better than Flink in > > > > > my tests. In my personal opinion, the metric in jstorm offers very > > > > > much informations. And the metric can tell us where is the > bottleneck > > > when we run a topology. > > > > The > > > > > performance bottleneck maybe serialize/deserialize/netty/executor > > > > > and so on. Of course, I also has some other good monitoring in the > > > > > world. So I hope we can choice the better monitoring before phrase > > > > > 2. And I will > > > > start > > > > > study the Alas. If it is better, I am pleasured to redesign the > > > > > metric by Alas. > > > > > -----邮件原件----- > > > > > 发件人: Bobby Evans [mailto:ev...@yahoo-inc.com.INVALID] > > > > > 发送时间: 2016年3月22日 22:36 > > > > > 收件人: dev@storm.apache.org > > > > > 主题: Re: Question on Metrics Server to Alibaba team > > > > > > > > > > My personal opinion is that we should not reinvent the wheel (aka > > > > > distributed fault tolerant metrics) ourselves. The local file > > > > > blobstore with nimbus HA was a big enough pain to write and it is > > > > > relatively simple in comparison. > > > > > If the JStorm code is simple and offers everything we need in terms > > > > > of HA and scalability then I would be OK with it, but if it doesn't > > > > > I would > > > > lean > > > > > towards a different compatible open source solution. > > > > > > > > > > https://github.com/Netflix/atlas > > > > > looks very promising as a default option. It is actively > maintained > > > > > by a group that I think has some of the best monitoring in the > > > > > world. And it > > > > is > > > > > both java and apache compatible. It has no histogram support that > I > > > > could > > > > > find, but that I don't see as being super critical. The biggest > > > > > drawback is there is little documentation on how to use it, to > > > > > really be able to evaluate it for our needs. - Bobby > > > > > > > > > > On Monday, March 21, 2016 7:29 PM, Jungtaek Lim > > > > > <kabh...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Harsha, > > > > > > > > > > That's why I think new metric feature of JStorm looks promising. > > > > > > > > > > According to design doc on > > > > > https://issues.apache.org/jira/browse/STORM-1329, > > > > > there's no distinction between topology stat (which Apache Storm > > > > > includes to worker heartbeat) and built-in metrics (which should be > > > > > handled with separate consumer, as you stated). > > > > > All metrics are passed to Nimbus and Nimbus cached metrics, which > > > > > implies we can treat all metrics as same, and we can also provide > > > > > built-in > > > > metrics > > > > > (including custom metrics) to users via REST API, too. > > > > > > > > > > I thought about standalone metrics server process which handles > > > > > whole metric works (maybe TopologyMaster + Nimbus on design doc), > > > > > but if > > > > current > > > > > implementation of metric feature on JStorm can take care of what > I'm > > > > > assuming, I guess it's great enough. > > > > > > > > > > Since I don't know about TopologyMaster, I just wonder that > there're > > > > > any SPOFs (including soft) and how metrics work when if component > of > > > > > SPOF > > > > goes > > > > > down. > > > > > Since Cody gives digging point to take a look at, we can evaluate > > > > > that feature before phase 2. > > > > > > > > > > Thanks, > > > > > Jungtaek Lim (HeartSaVioR) > > > > > > > > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <st...@harsha.io>님이 작성: > > > > > > > > > > > One of the goals of this work and probably can be addressed in > > > > > > separate jira is how the topology metrics reporter works. Today > > > > > > its a bolt thats part of a topology graph that means its another > > > > > > node in the Topology DAG that needs be tuned for better > > > > > > performance. Some of our users took performance hits by deploying > > > > > > topology metrics reporter that can send metrics to Ganglia. > > > > > > Ideally this collection should be asynchronous and not be a node > in > > > topology DAG. > > > > > > > > > > > > Shipping default metrics server and along with pluggable option > > > > > > for users who wants to graphite or other timeline servers should > > > > > > be the goal. > > > > > > > > > > > > --Harsha > > > > > > > > > > > > > > > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote: > > > > > > > @Cody - The design looks good. Does the design allow to > > > > > > > aggregate metrics at the task/executor level? Basically, number > > > > > > > of distinct metrics is proportional to the number of distinct > > > > > > > tasks, did you ever run into such a use case? > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere > > > > > > > <e.neve...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Also, you can read the code from our latest release JStorm > 2.1.1. > > > > > > > > > > > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere > > > > > > > > <e.neve...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > @Jungtaek, > > > > > > > > > We did some tests on codahale metrics, compared to > > > > > > > > > meters/histograms, counters are quite fast. So we mainly > > > > > > > > > focused on the optimization of > > > > > > > > meters > > > > > > > > > and histograms (they are indeed very slow) including double > > > > > > > > > sampling, changing the clock from ns (System.nanoTime) to > > > > > > > > > ms, > > > > etc. > > > > > > > > > You can take a look at the > > > > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" > > > > > > > > > class of our sequence-split-merge example code, as the > > > > > > > > > client code entry to > > > > > > metrics. > > > > > > > > > After that, you may dig to TopologyMaster class, which is > > > > > > > > > still part > > > > > > of a > > > > > > > > > topology, and then to TopologyMetricsRunnable, which is a > > > > > > > > > part of > > > > > > nimbus > > > > > > > > > server, finally to MetricUploader plugin, this is where the > > > > > > > > > metrics interfere with our "metrics server". Still, > there're > > > > > > > > > some nits in the > > > > > > > > code, > > > > > > > > > but I think that should be no big problem. > > > > > > > > > > > > > > > > > > I'd also like to point out that our "metrics server" is not > > > > > > > > > strictly > > > > > > a > > > > > > > > > real metrics server, since most of the duty lies on nimbus > > > > > > > > > server and topology master, it's more appropriate to call > it > > > > > metrics storage. > > > > > > The > > > > > > > > main > > > > > > > > > reason for this is that we don't want to make a > heavy-weight > > > > > > > > > metrics > > > > > > > > server > > > > > > > > > out of JStorm, and this makes us very easy to maintain (we > > > > > > > > > have teams > > > > > > > > that > > > > > > > > > specifically maintain HBase/OTS in Alibaba since they're so > > > > > > > > > commonly > > > > > > used > > > > > > > > > in production). > > > > > > > > > > > > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim > > > > > > > > > <kabh...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Thanks Cody and Bobby for the explanation. > > > > > > > > >> > > > > > > > > >> Cody, > > > > > > > > >> I took a look at design doc and looks promising, > especially > > > > > > > > >> it > > > > > > doesn't > > > > > > > > do > > > > > > > > >> sampling when metric type is 'counter'. As far as I heard > > > > > > > > >> (I didn't > > > > > > try > > > > > > > > >> it) > > > > > > > > >> it becomes huge performance hit in Apache Storm when we > > > > > > > > >> change > > > > > > sample > > > > > > > > rate > > > > > > > > >> to 1.0. > > > > > > > > >> Could you guide the entry point of metric feature in > JStorm > > > > > > > > >> to dig > > > > > > into? > > > > > > > > >> > > > > > > > > >> And just a curiosity, did you consider extracting metric > > > > > > > > >> feature > > > > > > (which > > > > > > > > is > > > > > > > > >> done with TopologyMasters and Nimbuses) into separate > > > component? > > > > > > > > >> I understood your mention to 'metrics server' as separate > > > > > > component, but > > > > > > > > >> after seeing design doc, feature seems to be implemented > on > > > > > Nimbus. > > > > > > > > >> > > > > > > > > >> Thanks, > > > > > > > > >> Jungtaek Lim (HeartSaVioR) > > > > > > > > >> > > > > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere > > > > > > > > >> <e.neve...@gmail.com>님이 > > > > > > 작성: > > > > > > > > >> > > > > > > > > >> > JStorm has provided a MetricUploader interface, which is > > > > > > > > >> > similar > > > > > > to > > > > > > > > >> > IMetricsConsumer in storm, and the underlying > > > > > > > > >> > implementation is > > > > > > > > >> pluggable, > > > > > > > > >> > you can use HBase, or any other KV store that supports > > > > > > > > >> > timeline > > > > > > > > queries > > > > > > > > >> or > > > > > > > > >> > even a database(maybe for it's a small cluster). We > > > > > > > > >> > provide model > > > > > > > > >> classes > > > > > > > > >> > in jstorm-core, as to what kinds of metrics data need to > > > > > > > > >> > be > > > > > > stored, > > > > > > > > it's > > > > > > > > >> > totally up to the detailed implementation. Our internal > > > > > > implementation > > > > > > > > >> uses > > > > > > > > >> > OTS, which is a product of aliyun ( > > > > > > > > https://www.aliyun.com/product/ots/ > > > > > > > > >> ), > > > > > > > > >> > but it's easy to adapt to other implementations. > > > > > > > > >> > > > > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans > > > > > > > > >> <ev...@yahoo-inc.com.invalid > > > > > > > > >> > > > > > > > > > > >> > wrote: > > > > > > > > >> > > > > > > > > > >> > > Yes we originally wanted to try and use the Hadoop > > > > > > > > >> > > Timeline > > > > > > Server > > > > > > > > for > > > > > > > > >> > > storm metrics feedback to nimbus + UI + history like > > > server. > > > > > > But it > > > > > > > > >> was > > > > > > > > >> > > not stable at the time, so we stopped. For the sake > of > > > > > > > > >> > > playing > > > > > > > > nicely > > > > > > > > >> > with > > > > > > > > >> > > the rest of the big data ecosystem I would like to see > > > > > > > > >> > > us > > > > > > support it > > > > > > > > >> as > > > > > > > > >> > an > > > > > > > > >> > > option for metrics collection/query, but until the > > > > > > > > >> > > timeline > > > > > > server > > > > > > > > v2 > > > > > > > > >> is > > > > > > > > >> > > ready and released. For me the important thing is > that > > > > > > > > >> > > we have > > > > > > a > > > > > > > > >> decent > > > > > > > > >> > > time series DB that comes with storm by default and is > > > > > > pluggable so > > > > > > > > we > > > > > > > > >> > can > > > > > > > > >> > > replace it with something else that has similar > > > > > > > > >> > > capabilities in > > > > > > the > > > > > > > > >> > future. > > > > > > > > >> > > - Bobby > > > > > > > > >> > > > > > > > > > > >> > > On Friday, March 18, 2016 10:39 AM, Cody Innowhere > < > > > > > > > > >> > >e.neve...@gmail.com> wrote: > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > It's actually in Phase 2 of porting JStorm, but I'm > > > > > > > > >> > >absolutely > > > > > > ok > > > > > > > > to > > > > > > > > >> > > discuss this in advance. > > > > > > > > >> > > > > > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere < > > > > > > > > e.neve...@gmail.com > > > > > > > > >> > > > > > > > > > >> > > wrote: > > > > > > > > >> > > > > > > > > > > >> > > > Yes it's already in production. > > > > > > > > >> > > > The implementation basically follows the design > > > > > > > > >> > > > document in > > > > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, > you > > > > > > > > >> > > > can > > > > > > take a > > > > > > > > >> look > > > > > > > > >> > > > first and feel free to ask questions. > > > > > > > > >> > > > > > > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim < > > > > > > kabh...@gmail.com > > > > > > > > > > > > > > > > > >> > > wrote: > > > > > > > > >> > > > > > > > > > > > >> > > >> Hi, > > > > > > > > >> > > >> > > > > > > > > >> > > >> I got something to do with metrics so I'm seeking > > > > > > > > >> > > >> the pull > > > > > > > > requests > > > > > > > > >> > > which > > > > > > > > >> > > >> addresses metrics. > > > > > > > > >> > > >> And at #753 > > > > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I > > > > > > found > > > > > > > > >> Cody > > > > > > > > >> > > said > > > > > > > > >> > > >> we > > > > > > > > >> > > >> (maybe it means Alibaba team) are currently working > > > > > > > > >> > > >> on > > > > > > Metrics > > > > > > > > >> Server. > > > > > > > > >> > > >> (I also found comment which said there was some > talk > > > > > > > > >> > > >> while > > > > > > ago > > > > > > > > >> around > > > > > > > > >> > > >> integrating Hadoop timeline server. Seems like no > > > > > > > > >> > > >> one came up > > > > > > > > with > > > > > > > > >> the > > > > > > > > >> > > >> result, and I prefer to avoid big dependency so I'm > > > > > > > > >> > > >> in favor > > > > > > of > > > > > > > > >> > Metrics > > > > > > > > >> > > >> Server for now.) > > > > > > > > >> > > >> > > > > > > > > >> > > >> I think that would improve metrics feature of Storm > > > > > > > > >> > > >> much > > > > > > better, > > > > > > > > so > > > > > > > > >> > I'd > > > > > > > > >> > > >> like to see how the work is going. Sure it's only > > > > > > > > >> > > >> when > > > > > > there's no > > > > > > > > >> > issue > > > > > > > > >> > > >> for > > > > > > > > >> > > >> you to work transparently. I just would like to > > > > > > > > >> > > >> prevent > > > > > > > > >> duplication of > > > > > > > > >> > > >> work, and would like to help if needed and > possible. > > > > > > > > >> > > >> > > > > > > > > >> > > >> Thanks, > > > > > > > > >> > > >> Jungtaek Lim (HeartSaVioR) > > > > > > > > >> > > >> > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Regards, > > > > > > > Abhishek Agarwal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >