Re: 答复: Question on Metrics Server to Alibaba team

Cody Innowhere Tue, 29 Mar 2016 19:18:13 -0700

@Harsha,
Currently we already use rocksdb to store a time series data rather than
the latest window values.


@Bobby,
I will think about HA and post a detailed document for review (together
with MetricUploader interface) later.

On Wed, Mar 30, 2016 at 9:35 AM, Harsha <[email protected]> wrote:

> Another thing to consider is to store a time series data not the current
> approach where we store 1min, 10min, 3hrs windowed approach and
> definitely not depend on external storage such as hdfs .
>
> On Fri, Mar 25, 2016, at 06:43 AM, Bobby Evans wrote:
> > My concern is really around how much time/effort it is to get to a final
> > solution, and to ultimately maintain/support that solution.  If I was
> > doing this from scratch I would probably pull something off of the shelf
> > that is tested and has an entire community supporting it instead of
> > writing something ourselves from scratch.  But in this case we have a
> > solution from JStorm, that we know works.  Because this is the backend
> > that we are talking about we can switch things out later on if we need
> > to.  Like I said before I am fine with using the JStorm code initially.
> > I mostly want to be sure of a few things.
> > 1. The metrics interface we expose to end users is well thought out and
> > can be extended in the future.2. The interfaces that connect this front
> > end to the back end are though out and we could replace the back end if
> > needed.3. The solution offers some level of high availability.  If Nimbus
> > a worker, etc. crash it is OK to lose some data, but we don't want to
> >  - Bobby
> >
> >     On Friday, March 25, 2016 6:26 AM, Cody Innowhere
> >     <[email protected]> wrote:
> >
> >
> >  Bobby,
> > I understand your concern. Still, I think our metrics design in JStorm
> > can
> > work without any external service, as I mentioned above, we can store
> > metrics in rocksdb on nimbus server. A rough thought will be: we store
> > the
> > latest 1 hour of 1-min window data, 10 hours of 10-min window data, 5
> > days
> > of 2-hour window data, 30 days of 1-day window, etc. And if there's the
> > need to sync metrics data between nimbus servers, we can add a sync
> > thread
> > to handle nimbus fail-over, since it's just metrics data that don't
> > really
> > matter too much, we can use a plain simple sync model.
> >
> > The external service is another option to end users, if users feel it's
> > important (or maybe their business built on top of storm is very
> > important), they can use this external service to build their own monitor
> > system which can be more useful than the original solution shipped with
> > storm.
> >
> > On Fri, Mar 25, 2016 at 2:09 AM, Bobby Evans
> > <[email protected]>
> > wrote:
> >
> > > The problem is that we want something for storm that can work out of
> the
> > > box, ideally without some other complicated external service (except
> > > zookeeper which we already have, and is not actually that complex to
> setup
> > > and run).
> > > If we feel that we must have some external state store that is required
> > > for storm to run, then we need to make the decision carefully and
> > > deliberately.
> > >  - Bobby
> > >
> > >    On Wednesday, March 23, 2016 8:37 AM, John Fang <
> > > [email protected]> wrote:
> > >
> > >
> > >  Sorry , I misunderstand it. We will make H/A for TopologyMaster. And
> > > metric meta will store at HDFS,  So the metrics meta won't rely on the
> > > nimbus. It can enhance the stability of the metric system.
> > >
> > > -----邮件原件-----
> > > 发件人: Cody Innowhere [mailto:[email protected]]
> > > 发送时间: 2016年3月23日 19:59
> > > 收件人: [email protected]
> > > 主题: Re: Question on Metrics Server to Alibaba team
> > >
> > > If we don't rely on any external system, our metrics system is still
> > > available but will store metrics meta/data in rocksdb on nimbus
> servers.
> > > There will be limits though, for example, we cannot store metrics data
> all
> > > through the topology lifecycle, because rocksdb is only a KV storage,
> it
> > > may not support efficient scan operations and too much data in local
> disk
> > > may bring in extra IO overhead, so we may have to store latest 1hour
> of m1
> > > data, 6 hours of m10 data as such (currently not implemented in
> JStorm, but
> > > quite easy to do this).
> > >
> > > TopologyMaster is merely a channel for registering/computing/uploading
> > > metrics to nimbus, so if a TM goes down, the topology metrics will be
> > > unavailable for a while before it gets pulled up somewhere else(for a
> > > normal failover case, this should be very fast), while
> supervisor/nimbus
> > > metrics are unaffected as they're sent to nimbus via thrift interface.
> As
> > > long as TM is back, the topology metrics will be available again.
> > >
> > > Currently JStorm does sync metrics meta but metrics data between
> multiple
> > > nimbus serers is not synced. So under a nimbus failure, possibly we may
> > > lose some metrics data.
> > >
> > >
> > > On Wed, Mar 23, 2016 at 3:19 PM, Jungtaek Lim <[email protected]>
> wrote:
> > >
> > > > John,
> > > >
> > > > My concern is H/A of metrics on Storm by default. (I'm not 100% sure
> > > > Bobby pointed out same things.)
> > > >
> > > > Since Apache Storm has been used by various users so that we can't
> > > > assume that users have knowledges of external systems (including
> > > > Hadoop ecosystem, personal opinion) and operate them smoothly.
> > > > It reminds me about the importance to keep in mind about default.
> > > >
> > > > Therefore, I'm curious that new metrics feature of JStom can work
> > > > smoothly without external system (HBase / OTS). And love to see it
> > > > supports H/A without other systems, or users have to tolerate lost of
> > > > metrics for some scenarios.
> > > >
> > > > I guess this may be valid questions on H/A (as far as my
> understanding
> > > > of design doc is right): How metrics work when TopologyMaster is
> down?
> > > > And how metrics work when failover of Nimbus occurs?
> > > >
> > > > Personally I don't mind losing metrics for short durations (just want
> > > > to check availability of H/A), but failure shouldn't mess up whole
> > > metrics.
> > > >
> > > > Thanks,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > > 2016년 3월 23일 (수) 오후 3:39, John Fang <[email protected]>님이
> 작성:
> > > >
> > > > > @ Bobby Evans Jstorm code has experienced a lot of tests over the
> > > > > past
> > > > few
> > > > > years, espatially HA and scalability. We have done a lot of
> > > > > optimization about Metrics. The performance is better than Flink in
> > > > > my tests. In my personal opinion, the metric in jstorm offers very
> > > > > much informations. And the metric can tell us where is the
> bottleneck
> > > when we run a topology.
> > > > The
> > > > > performance bottleneck maybe serialize/deserialize/netty/executor
> > > > > and so on. Of course, I also has some other good monitoring in the
> > > > > world. So I hope we can choice the better monitoring before phrase
> > > > > 2. And I will
> > > > start
> > > > > study the Alas. If it is better, I am pleasured to redesign the
> > > > > metric by Alas.
> > > > > -----邮件原件-----
> > > > > 发件人: Bobby Evans [mailto:[email protected]]
> > > > > 发送时间: 2016年3月22日 22:36
> > > > > 收件人: [email protected]
> > > > > 主题: Re: Question on Metrics Server to Alibaba team
> > > > >
> > > > > My personal opinion is that we should not reinvent the wheel (aka
> > > > > distributed fault tolerant metrics) ourselves.  The local file
> > > > > blobstore with nimbus HA was a big enough pain to write and it is
> > > > > relatively simple in comparison.
> > > > > If the JStorm code is simple and offers everything we need in terms
> > > > > of HA and scalability then I would be OK with it, but if it doesn't
> > > > > I would
> > > > lean
> > > > > towards a different compatible open source solution.
> > > > >
> > > > > https://github.com/Netflix/atlas
> > > > > looks very promising as a default option.  It is actively
> maintained
> > > > > by a group that I think has some of the best monitoring in the
> > > > > world.  And it
> > > > is
> > > > > both java and apache compatible.  It has no histogram support that
> I
> > > > could
> > > > > find, but that I don't see as being super critical.  The biggest
> > > > > drawback is there is little documentation on how to use it, to
> > > > > really be able to evaluate it for our needs. - Bobby
> > > > >
> > > > >    On Monday, March 21, 2016 7:29 PM, Jungtaek Lim
> > > > > <[email protected]>
> > > > > wrote:
> > > > >
> > > > >
> > > > >  Harsha,
> > > > >
> > > > > That's why I think new metric feature of JStorm looks promising.
> > > > >
> > > > > According to design doc on
> > > > > https://issues.apache.org/jira/browse/STORM-1329,
> > > > > there's no distinction between topology stat (which Apache Storm
> > > > > includes to worker heartbeat) and built-in metrics (which should be
> > > > > handled with separate consumer, as you stated).
> > > > > All metrics are passed to Nimbus and Nimbus cached metrics, which
> > > > > implies we can treat all metrics as same, and we can also provide
> > > > > built-in
> > > > metrics
> > > > > (including custom metrics) to users via REST API, too.
> > > > >
> > > > > I thought about standalone metrics server process which handles
> > > > > whole metric works (maybe TopologyMaster + Nimbus on design doc),
> > > > > but if
> > > > current
> > > > > implementation of metric feature on JStorm can take care of what
> I'm
> > > > > assuming, I guess it's great enough.
> > > > >
> > > > > Since I don't know about TopologyMaster, I just wonder that
> there're
> > > > > any SPOFs (including soft) and how metrics work when if component
> of
> > > > > SPOF
> > > > goes
> > > > > down.
> > > > > Since Cody gives digging point to take a look at, we can evaluate
> > > > > that feature before phase 2.
> > > > >
> > > > > Thanks,
> > > > > Jungtaek Lim (HeartSaVioR)
> > > > >
> > > > > 2016년 3월 22일 (화) 오전 1:36, Harsha <[email protected]>님이 작성:
> > > > >
> > > > > > One of the goals of this work and probably can be addressed in
> > > > > > separate jira is how the topology metrics reporter works. Today
> > > > > > its a bolt thats part of a topology graph that means its another
> > > > > > node in the Topology DAG that needs be tuned for better
> > > > > > performance. Some of our users took performance hits by deploying
> > > > > > topology metrics reporter that can send metrics to Ganglia.
> > > > > > Ideally this collection should be asynchronous and not be a node
> in
> > > topology DAG.
> > > > > >
> > > > > > Shipping default metrics server and along with pluggable option
> > > > > > for users who wants to graphite or other timeline servers should
> > > > > > be the goal.
> > > > > >
> > > > > > --Harsha
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 21, 2016, at 08:49 AM, Abhishek Agarwal wrote:
> > > > > > > @Cody - The design looks good. Does the design allow to
> > > > > > > aggregate metrics at the task/executor level? Basically, number
> > > > > > > of distinct metrics is proportional to the number of distinct
> > > > > > > tasks, did you ever run into such a use case?
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 21, 2016 at 8:46 PM, Cody Innowhere
> > > > > > > <[email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Also, you can read the code from our latest release JStorm
> 2.1.1.
> > > > > > > >
> > > > > > > > On Mon, Mar 21, 2016 at 11:10 PM, Cody Innowhere
> > > > > > > > <[email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > @Jungtaek,
> > > > > > > > > We did some tests on codahale metrics, compared to
> > > > > > > > > meters/histograms, counters are quite fast. So we mainly
> > > > > > > > > focused on the optimization of
> > > > > > > > meters
> > > > > > > > > and histograms (they are indeed very slow) including double
> > > > > > > > > sampling, changing the clock from ns (System.nanoTime) to
> > > > > > > > > ms,
> > > > etc.
> > > > > > > > > You can take a look at the
> > > > > > > > > "com.alipay.dw.jstorm.example.sequence.bolt.TotalCount"
> > > > > > > > > class of our sequence-split-merge example code, as the
> > > > > > > > > client code entry to
> > > > > > metrics.
> > > > > > > > > After that, you may dig to TopologyMaster class, which is
> > > > > > > > > still part
> > > > > > of a
> > > > > > > > > topology, and then to TopologyMetricsRunnable, which is a
> > > > > > > > > part of
> > > > > > nimbus
> > > > > > > > > server, finally to MetricUploader plugin, this is where the
> > > > > > > > > metrics interfere with our "metrics server". Still,
> there're
> > > > > > > > > some nits in the
> > > > > > > > code,
> > > > > > > > > but I think that should be no big problem.
> > > > > > > > >
> > > > > > > > > I'd also like to point out that our "metrics server" is not
> > > > > > > > > strictly
> > > > > > a
> > > > > > > > > real metrics server, since most of the duty lies on nimbus
> > > > > > > > > server and topology master, it's more appropriate to call
> it
> > > > > metrics storage.
> > > > > > The
> > > > > > > > main
> > > > > > > > > reason for this is that we don't want to make a
> heavy-weight
> > > > > > > > > metrics
> > > > > > > > server
> > > > > > > > > out of JStorm, and this makes us very easy to maintain (we
> > > > > > > > > have teams
> > > > > > > > that
> > > > > > > > > specifically maintain HBase/OTS in Alibaba since they're so
> > > > > > > > > commonly
> > > > > > used
> > > > > > > > > in production).
> > > > > > > > >
> > > > > > > > > On Mon, Mar 21, 2016 at 10:54 PM, Jungtaek Lim
> > > > > > > > > <[email protected]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Thanks Cody and Bobby for the explanation.
> > > > > > > > >>
> > > > > > > > >> Cody,
> > > > > > > > >> I took a look at design doc and looks promising,
> especially
> > > > > > > > >> it
> > > > > > doesn't
> > > > > > > > do
> > > > > > > > >> sampling when metric type is 'counter'. As far as I heard
> > > > > > > > >> (I didn't
> > > > > > try
> > > > > > > > >> it)
> > > > > > > > >> it becomes huge performance hit in Apache Storm when we
> > > > > > > > >> change
> > > > > > sample
> > > > > > > > rate
> > > > > > > > >> to 1.0.
> > > > > > > > >> Could you guide the entry point of metric feature in
> JStorm
> > > > > > > > >> to dig
> > > > > > into?
> > > > > > > > >>
> > > > > > > > >> And just a curiosity, did you consider extracting metric
> > > > > > > > >> feature
> > > > > > (which
> > > > > > > > is
> > > > > > > > >> done with TopologyMasters and Nimbuses) into separate
> > > component?
> > > > > > > > >> I understood your mention to 'metrics server' as separate
> > > > > > component, but
> > > > > > > > >> after seeing design doc, feature seems to be implemented
> on
> > > > > Nimbus.
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > > >>
> > > > > > > > >> 2016년 3월 19일 (토) 오전 1:25, Cody Innowhere
> > > > > > > > >> <[email protected]>님이
> > > > > > 작성:
> > > > > > > > >>
> > > > > > > > >> > JStorm has provided a MetricUploader interface, which is
> > > > > > > > >> > similar
> > > > > > to
> > > > > > > > >> > IMetricsConsumer in storm, and the underlying
> > > > > > > > >> > implementation is
> > > > > > > > >> pluggable,
> > > > > > > > >> > you can use HBase, or any other KV store that supports
> > > > > > > > >> > timeline
> > > > > > > > queries
> > > > > > > > >> or
> > > > > > > > >> > even a database(maybe for it's a small cluster). We
> > > > > > > > >> > provide model
> > > > > > > > >> classes
> > > > > > > > >> > in jstorm-core, as to what kinds of metrics data need to
> > > > > > > > >> > be
> > > > > > stored,
> > > > > > > > it's
> > > > > > > > >> > totally up to the detailed implementation. Our internal
> > > > > > implementation
> > > > > > > > >> uses
> > > > > > > > >> > OTS, which is a product of aliyun (
> > > > > > > > https://www.aliyun.com/product/ots/
> > > > > > > > >> ),
> > > > > > > > >> > but it's easy to adapt to other implementations.
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Mar 18, 2016 at 11:52 PM, Bobby Evans
> > > > > > > > >> <[email protected]
> > > > > > > > >> > >
> > > > > > > > >> > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Yes we originally wanted to try and use the Hadoop
> > > > > > > > >> > > Timeline
> > > > > > Server
> > > > > > > > for
> > > > > > > > >> > > storm metrics feedback to nimbus + UI + history like
> > > server.
> > > > > > But it
> > > > > > > > >> was
> > > > > > > > >> > > not stable at the time, so we stopped.  For the sake
> of
> > > > > > > > >> > > playing
> > > > > > > > nicely
> > > > > > > > >> > with
> > > > > > > > >> > > the rest of the big data ecosystem I would like to see
> > > > > > > > >> > > us
> > > > > > support it
> > > > > > > > >> as
> > > > > > > > >> > an
> > > > > > > > >> > > option for metrics collection/query, but until the
> > > > > > > > >> > > timeline
> > > > > > server
> > > > > > > > v2
> > > > > > > > >> is
> > > > > > > > >> > > ready and released.  For me the important thing is
> that
> > > > > > > > >> > > we have
> > > > > > a
> > > > > > > > >> decent
> > > > > > > > >> > > time series DB that comes with storm by default and is
> > > > > > pluggable so
> > > > > > > > we
> > > > > > > > >> > can
> > > > > > > > >> > > replace it with something else that has similar
> > > > > > > > >> > > capabilities in
> > > > > > the
> > > > > > > > >> > future.
> > > > > > > > >> > >  - Bobby
> > > > > > > > >> > >
> > > > > > > > >> > >    On Friday, March 18, 2016 10:39 AM, Cody Innowhere
> <
> > > > > > > > >> > >[email protected]> wrote:
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >  It's actually in Phase 2 of porting JStorm, but I'm
> > > > > > > > >> > >absolutely
> > > > > > ok
> > > > > > > > to
> > > > > > > > >> > > discuss this in advance.
> > > > > > > > >> > >
> > > > > > > > >> > > On Fri, Mar 18, 2016 at 11:31 PM, Cody Innowhere <
> > > > > > > > [email protected]
> > > > > > > > >> >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Yes it's already in production.
> > > > > > > > >> > > > The implementation basically follows the design
> > > > > > > > >> > > > document in
> > > > > > > > >> > > > https://issues.apache.org/jira/browse/STORM-1329,
> you
> > > > > > > > >> > > > can
> > > > > > take a
> > > > > > > > >> look
> > > > > > > > >> > > > first and feel free to ask questions.
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Fri, Mar 18, 2016 at 10:19 PM, Jungtaek Lim <
> > > > > > [email protected]
> > > > > > > > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > >> Hi,
> > > > > > > > >> > > >>
> > > > > > > > >> > > >> I got something to do with metrics so I'm seeking
> > > > > > > > >> > > >> the pull
> > > > > > > > requests
> > > > > > > > >> > > which
> > > > > > > > >> > > >> addresses metrics.
> > > > > > > > >> > > >> And at #753
> > > > > > > > >> > > >> <https://github.com/apache/storm/pull/753> I
> > > > > > found
> > > > > > > > >> Cody
> > > > > > > > >> > > said
> > > > > > > > >> > > >> we
> > > > > > > > >> > > >> (maybe it means Alibaba team) are currently working
> > > > > > > > >> > > >> on
> > > > > > Metrics
> > > > > > > > >> Server.
> > > > > > > > >> > > >> (I also found comment which said there was some
> talk
> > > > > > > > >> > > >> while
> > > > > > ago
> > > > > > > > >> around
> > > > > > > > >> > > >> integrating Hadoop timeline server. Seems like no
> > > > > > > > >> > > >> one came up
> > > > > > > > with
> > > > > > > > >> the
> > > > > > > > >> > > >> result, and I prefer to avoid big dependency so I'm
> > > > > > > > >> > > >> in favor
> > > > > > of
> > > > > > > > >> > Metrics
> > > > > > > > >> > > >> Server for now.)
> > > > > > > > >> > > >>
> > > > > > > > >> > > >> I think that would improve metrics feature of Storm
> > > > > > > > >> > > >> much
> > > > > > better,
> > > > > > > > so
> > > > > > > > >> > I'd
> > > > > > > > >> > > >> like to see how the work is going. Sure it's only
> > > > > > > > >> > > >> when
> > > > > > there's no
> > > > > > > > >> > issue
> > > > > > > > >> > > >> for
> > > > > > > > >> > > >> you to work transparently. I just would like to
> > > > > > > > >> > > >> prevent
> > > > > > > > >> duplication of
> > > > > > > > >> > > >> work, and would like to help if needed and
> possible.
> > > > > > > > >> > > >>
> > > > > > > > >> > > >> Thanks,
> > > > > > > > >> > > >> Jungtaek Lim (HeartSaVioR)
> > > > > > > > >> > > >>
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > > Abhishek Agarwal
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
> >
>

Re: 答复: Question on Metrics Server to Alibaba team

Reply via email to