@Harsha If we not depend on external storage such as hdfs, we can depend on the RockDB.
-----邮件原件----- 发件人: Harsha [mailto:[email protected]] 发送时间: 2016年3月30日 9:36 收件人: [email protected] 主题: Re: 答复: Question on Metrics Server to Alibaba team Another thing to consider is to store a time series data not the current approach where we store 1min, 10min, 3hrs windowed approach and definitely not depend on external storage such as hdfs . On Fri, Mar 25, 2016, at 06:43 AM, Bobby Evans wrote: > My concern is really around how much time/effort it is to get to a > final solution, and to ultimately maintain/support that solution. If > I was doing this from scratch I would probably pull something off of > the shelf that is tested and has an entire community supporting it > instead of writing something ourselves from scratch. But in this case > we have a solution from JStorm, that we know works. Because this is > the backend that we are talking about we can switch things out later > on if we need to. Like I said before I am fine with using the JStorm code > initially. > I mostly want to be sure of a few things. > 1. The metrics interface we expose to end users is well thought out > and can be extended in the future.2. The interfaces that connect this > front end to the back end are though out and we could replace the back > end if needed.3. The solution offers some level of high availability. > If Nimbus a worker, etc. crash it is OK to lose some data, but we > don't want to > - Bobby > > On Friday, March 25, 2016 6:26 AM, Cody Innowhere > <[email protected]> wrote: > > > Bobby, > I understand your concern. Still, I think our metrics design in JStorm > can work without any external service, as I mentioned above, we can > store metrics in rocksdb on nimbus server. A rough thought will be: we > store the latest 1 hour of 1-min window data, 10 hours of 10-min > window data, 5 days of 2-hour window data, 30 days of 1-day window, > etc. And if there's the need to sync metrics data between nimbus > servers, we can add a sync thread to handle nimbus fail-over, since > it's just metrics data that don't really matter too much, we can use a > plain simple sync model. > > The external service is another option to end users, if users feel > it's important (or maybe their business built on top of storm is very > important), they can use this external service to build their own > monitor system which can be more useful than the original solution > shipped with storm. > > On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans > <[email protected]> > wrote: > > > The problem is that we want something for storm that can work out of > >the box, ideally without some other complicated external service > >(except zookeeper which we already have, and is not actually that > >complex to setup and run). > > If we feel that we must have some external state store that is > >required for storm to run, then we need to make the decision > >carefully and deliberately. > > - Bobby > > > > On Wednesday, March 23, 2016 8:37 AM, John Fang < > >[email protected]> wrote: > > > > > > Sorry , I misunderstand it. We will make H/A for TopologyMaster. > >And metric meta will store at HDFS, So the metrics meta won't rely > >on the nimbus. It can enhance the stability of the metric system. > > > > -----邮件原件----- > > 发件人: Cody Innowhere [mailto:[email protected]] > > 发送时间: 2016年3月23日 19:59 > > 收件人: [email protected] > > 主题: Re: Question on Metrics Server to Alibaba team > > > > If we don't rely on any external system, our metrics system is still > > available but will store metrics meta/data in rocksdb on nimbus servers. > > There will be limits though, for example, we cannot store metrics > > data all through the topology lifecycle, because rocksdb is only a > > KV storage, it may not support efficient scan operations and too > > much data in local disk may bring in extra IO overhead, so we may > > have to store latest 1hour of m1 data, 6 hours of m10 data as such > > (currently not implemented in JStorm, but quite easy to do this). > > > > TopologyMaster is merely a channel for > > registering/computing/uploading metrics to nimbus, so if a TM goes > > down, the topology metrics will be unavailable for a while before it > > gets pulled up somewhere else(for a normal failover case, this > > should be very fast), while supervisor/nimbus metrics are unaffected > > as they're sent to nimbus via thrift interface. As long as TM is back, the > > topology metrics will be available again. > > > > Currently JStorm does sync metrics meta but metrics data between > > multiple nimbus serers is not synced. So under a nimbus failure, > > possibly we may lose some metrics data. > > > > > > On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <[email protected]> wrote: > > > > > John, > > > > > > My concern is H/A of metrics on Storm by default. (I'm not 100% > > > sure Bobby pointed out same things.) > > > > > > Since Apache Storm has been used by various users so that we can't > > > assume that users have knowledges of external systems (including > > > Hadoop ecosystem, personal opinion) and operate them smoothly. > > > It reminds me about the importance to keep in mind about default. > > > > > > Therefore, I'm curious that new metrics feature of JStom can work > > > smoothly without external system (HBase / OTS). And love to see it > > > supports H/A without other systems, or users have to tolerate lost > > > of metrics for some scenarios. > > > > > > I guess this may be valid questions on H/A (as far as my > > > understanding of design doc is right): How metrics work when > > > TopologyMaster is down? > > > And how metrics work when failover of Nimbus occurs? > > > > > > Personally I don't mind losing metrics for short durations (just > > > want to check availability of H/A), but failure shouldn't mess up > > > whole > > metrics. > > > > > > Thanks, > > > Jungtaek Lim (HeartSaVioR) > > > > > > 2016년 3월 23일 (수) 오후 3:39, John Fang <[email protected]>님이 작성: > > > > > > > @ Bobby Evans Jstorm code has experienced a lot of tests over > > > > the past > > > few > > > > years, espatially HA and scalability. We have done a lot of > > > > optimization about Metrics. The performance is better than Flink > > > > in my tests. In my personal opinion, the metric in jstorm offers > > > > very much informations. And the metric can tell us where is the > > > > bottleneck > > when we run a topology. > > > The > > > > performance bottleneck maybe > > > > serialize/deserialize/netty/executor > > > > and so on. Of course, I also has some other good monitoring in > > > > the world. So I hope we can choice the better monitoring before > > > > phrase 2. And I will > > > start > > > > study the Alas. If it is better, I am pleasured to redesign the > > > > metric by Alas. > > > > -----邮件原件----- > > > > 发件人: Bobby Evans [mailto:[email protected]] > > > > 发送时间: 2016年3月22日 22:36 > > > > 收件人: [email protected] > > > > 主题: Re: Question on Metrics Server to Alibaba team > > > > > > > > My personal opinion is that we should not reinvent the wheel > > > > (aka distributed fault tolerant metrics) ourselves. The local > > > > file blobstore with nimbus HA was a big enough pain to write and > > > > it is relatively simple in comparison. > > > > If the JStorm code is simple and offers everything we need in > > > > terms of HA and scalability then I would be OK with it, but if > > > > it doesn't I would > > > lean > > > > towards a different compatible open source solution. > > > > > > > > https://github.com/Netflix/atlas looks very promising as a > > > > default option. It is actively maintained by a group that I > > > > think has some of the best monitoring in the world. And it > > > is > > > > both java and apache compatible. It has no histogram support > > > > that I > > > could > > > > find, but that I don't see as being super critical. The biggest > > > > drawback is there is little documentation on how to use it, to > > > > really be able to evaluate it for our needs. - Bobby > > > > > > > > On Monday, March 21, 2016 7:29 PM, Jungtaek Lim > > > ><[email protected]> > > > > wrote: > > > > > > > > > > > > Harsha, > > > > > > > > That's why I think new metric feature of JStorm looks promising. > > > > > > > > According to design doc on > > > > https://issues.apache.org/jira/browse/STORM-1329, > > > > there's no distinction between topology stat (which Apache Storm > > > > includes to worker heartbeat) and built-in metrics (which should > > > > be handled with separate consumer, as you stated). > > > > All metrics are passed to Nimbus and Nimbus cached metrics, > > > > which implies we can treat all metrics as same, and we can also > > > > provide built-in > > > metrics > > > > (including custom metrics) to users via REST API, too. > > > > > > > > I thought about standalone metrics server process which handles > > > > whole metric works (maybe TopologyMaster + Nimbus on design > > > > doc), but if > > > current > > > > implementation of metric feature on JStorm can take care of what > > > > I'm assuming, I guess it's great enough. > > > > > > > > Since I don't know about TopologyMaster, I just wonder that > > > > there're any SPOFs (including soft) and how metrics work when if > > > > component of SPOF > > > goes > > > > down. > > > > Since Cody gives digging point to take a look at, we can > > > > evaluate that feature before phase 2. > > > > > > > > Thanks, > > > > Jungtaek Lim (HeartSaVioR) > > > > > > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <[email protected]>님이 작성: > > > > > > > > > One of the goals of this work and probably can be addressed in > > > > > separate jira is how the topology metrics reporter works. > > > > > Today its a bolt thats part of a topology graph that means its > > > > > another node in the Topology DAG that needs be tuned for > > > > > better performance. Some of our users took performance hits by > > > > > deploying topology metrics reporter that can send metrics to Ganglia. > > > > > Ideally this collection should be asynchronous and not be a > > > > > node in > > topology DAG. > > > > > > > > > > Shipping default metrics server and along with pluggable > > > > > option for users who wants to graphite or other timeline > > > > > servers should be the goal. > > > > > > > > > > --Harsha > > > > > > > > > > > > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote: > > > > > > @Cody - The design looks good. Does the design allow to > > > > > > aggregate metrics at the task/executor level? Basically, > > > > > > number of distinct metrics is proportional to the number of > > > > > > distinct tasks, did you ever run into such a use case? > > > > > > > > > > > > > > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere > > > > > > <[email protected]> > > > > > > wrote: > > > > > > > > > > > > > Also, you can read the code from our latest release JStorm 2.1.1. > > > > > > > > > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere > > > > > > > <[email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > @Jungtaek, > > > > > > > > We did some tests on codahale metrics, compared to > > > > > > > > meters/histograms, counters are quite fast. So we mainly > > > > > > > > focused on the optimization of > > > > > > > meters > > > > > > > > and histograms (they are indeed very slow) including > > > > > > > > double sampling, changing the clock from ns > > > > > > > > (System.nanoTime) to ms, > > > etc. > > > > > > > > You can take a look at the > > > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount" > > > > > > > > class of our sequence-split-merge example code, as the > > > > > > > > client code entry to > > > > > metrics. > > > > > > > > After that, you may dig to TopologyMaster class, which > > > > > > > > is still part > > > > > of a > > > > > > > > topology, and then to TopologyMetricsRunnable, which is > > > > > > > > a part of > > > > > nimbus > > > > > > > > server, finally to MetricUploader plugin, this is where > > > > > > > > the metrics interfere with our "metrics server". Still, > > > > > > > > there're some nits in the > > > > > > > code, > > > > > > > > but I think that should be no big problem. > > > > > > > > > > > > > > > > I'd also like to point out that our "metrics server" is > > > > > > > > not strictly > > > > > a > > > > > > > > real metrics server, since most of the duty lies on > > > > > > > > nimbus server and topology master, it's more appropriate > > > > > > > > to call it > > > > metrics storage. > > > > > The > > > > > > > main > > > > > > > > reason for this is that we don't want to make a > > > > > > > > heavy-weight metrics > > > > > > > server > > > > > > > > out of JStorm, and this makes us very easy to maintain > > > > > > > > (we have teams > > > > > > > that > > > > > > > > specifically maintain HBase/OTS in Alibaba since they're > > > > > > > > so commonly > > > > > used > > > > > > > > in production). > > > > > > > > > > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim > > > > > > > > <[email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > >> Thanks Cody and Bobby for the explanation. > > > > > > > >> > > > > > > > >> Cody, > > > > > > > >> I took a look at design doc and looks promising, > > > > > > > >> especially it > > > > > doesn't > > > > > > > do > > > > > > > >> sampling when metric type is 'counter'. As far as I > > > > > > > >> heard (I didn't > > > > > try > > > > > > > >> it) > > > > > > > >> it becomes huge performance hit in Apache Storm when we > > > > > > > >> change > > > > > sample > > > > > > > rate > > > > > > > >> to 1.0. > > > > > > > >> Could you guide the entry point of metric feature in > > > > > > > >> JStorm to dig > > > > > into? > > > > > > > >> > > > > > > > >> And just a curiosity, did you consider extracting > > > > > > > >> metric feature > > > > > (which > > > > > > > is > > > > > > > >> done with TopologyMasters and Nimbuses) into separate > > component? > > > > > > > >> I understood your mention to 'metrics server' as > > > > > > > >> separate > > > > > component, but > > > > > > > >> after seeing design doc, feature seems to be > > > > > > > >> implemented on > > > > Nimbus. > > > > > > > >> > > > > > > > >> Thanks, > > > > > > > >> Jungtaek Lim (HeartSaVioR) > > > > > > > >> > > > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere > > > > > > > >> <[email protected]>님이 > > > > > 작성: > > > > > > > >> > > > > > > > >> > JStorm has provided a MetricUploader interface, which > > > > > > > >> > is similar > > > > > to > > > > > > > >> > IMetricsConsumer in storm, and the underlying > > > > > > > >> > implementation is > > > > > > > >> pluggable, > > > > > > > >> > you can use HBase, or any other KV store that > > > > > > > >> > supports timeline > > > > > > > queries > > > > > > > >> or > > > > > > > >> > even a database(maybe for it's a small cluster). We > > > > > > > >> > provide model > > > > > > > >> classes > > > > > > > >> > in jstorm-core, as to what kinds of metrics data need > > > > > > > >> > to be > > > > > stored, > > > > > > > it's > > > > > > > >> > totally up to the detailed implementation. Our > > > > > > > >> > internal > > > > > implementation > > > > > > > >> uses > > > > > > > >> > OTS, which is a product of aliyun ( > > > > > > > https://www.aliyun.com/product/ots/ > > > > > > > >> ), > > > > > > > >> > but it's easy to adapt to other implementations. > > > > > > > >> > > > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans > > > > > > > >> <[email protected] > > > > > > > >> > > > > > > > > > >> > wrote: > > > > > > > >> > > > > > > > > >> > > Yes we originally wanted to try and use the Hadoop > > > > > > > >> > > Timeline > > > > > Server > > > > > > > for > > > > > > > >> > > storm metrics feedback to nimbus + UI + history > > > > > > > >> > > like > > server. > > > > > But it > > > > > > > >> was > > > > > > > >> > > not stable at the time, so we stopped. For the > > > > > > > >> > > sake of playing > > > > > > > nicely > > > > > > > >> > with > > > > > > > >> > > the rest of the big data ecosystem I would like to > > > > > > > >> > > see us > > > > > support it > > > > > > > >> as > > > > > > > >> > an > > > > > > > >> > > option for metrics collection/query, but until the > > > > > > > >> > > timeline > > > > > server > > > > > > > v2 > > > > > > > >> is > > > > > > > >> > > ready and released. For me the important thing is > > > > > > > >> > > that we have > > > > > a > > > > > > > >> decent > > > > > > > >> > > time series DB that comes with storm by default and > > > > > > > >> > > is > > > > > pluggable so > > > > > > > we > > > > > > > >> > can > > > > > > > >> > > replace it with something else that has similar > > > > > > > >> > > capabilities in > > > > > the > > > > > > > >> > future. > > > > > > > >> > > - Bobby > > > > > > > >> > > > > > > > > > >> > > On Friday, March 18, 2016 10:39 AM, Cody > > > > > > > >> > >Innowhere < [email protected]> wrote: > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > It's actually in Phase 2 of porting JStorm, but > > > > > > > >> > >I'm absolutely > > > > > ok > > > > > > > to > > > > > > > >> > > discuss this in advance. > > > > > > > >> > > > > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere < > > > > > > > [email protected] > > > > > > > >> > > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > >> > > > Yes it's already in production. > > > > > > > >> > > > The implementation basically follows the design > > > > > > > >> > > > document in > > > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329, > > > > > > > >> > > > you can > > > > > take a > > > > > > > >> look > > > > > > > >> > > > first and feel free to ask questions. > > > > > > > >> > > > > > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim < > > > > > [email protected] > > > > > > > > > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > > >> > > >> Hi, > > > > > > > >> > > >> > > > > > > > >> > > >> I got something to do with metrics so I'm > > > > > > > >> > > >> seeking the pull > > > > > > > requests > > > > > > > >> > > which > > > > > > > >> > > >> addresses metrics. > > > > > > > >> > > >> And at #753 > > > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I > > > > > found > > > > > > > >> Cody > > > > > > > >> > > said > > > > > > > >> > > >> we > > > > > > > >> > > >> (maybe it means Alibaba team) are currently > > > > > > > >> > > >> working on > > > > > Metrics > > > > > > > >> Server. > > > > > > > >> > > >> (I also found comment which said there was some > > > > > > > >> > > >> talk while > > > > > ago > > > > > > > >> around > > > > > > > >> > > >> integrating Hadoop timeline server. Seems like > > > > > > > >> > > >> no one came up > > > > > > > with > > > > > > > >> the > > > > > > > >> > > >> result, and I prefer to avoid big dependency so > > > > > > > >> > > >> I'm in favor > > > > > of > > > > > > > >> > Metrics > > > > > > > >> > > >> Server for now.) > > > > > > > >> > > >> > > > > > > > >> > > >> I think that would improve metrics feature of > > > > > > > >> > > >> Storm much > > > > > better, > > > > > > > so > > > > > > > >> > I'd > > > > > > > >> > > >> like to see how the work is going. Sure it's > > > > > > > >> > > >> only when > > > > > there's no > > > > > > > >> > issue > > > > > > > >> > > >> for > > > > > > > >> > > >> you to work transparently. I just would like to > > > > > > > >> > > >> prevent > > > > > > > >> duplication of > > > > > > > >> > > >> work, and would like to help if needed and possible. > > > > > > > >> > > >> > > > > > > > >> > > >> Thanks, > > > > > > > >> > > >> Jungtaek Lim (HeartSaVioR) > > > > > > > >> > > >> > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards, > > > > > > Abhishek Agarwal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
