Re: [PROPOSAL] External Join with KV Stores

2017-08-28 Thread JingsongLee
Yes, the runner can hold the entire side input in the right way.But it will be 
some waste, in the case of large amounts of data.
Best, Jingsong Lee

--From:Lukasz 
Cwik <lc...@google.com.INVALID>Time:2017 Aug 25 (Fri) 23:26To:dev 
<dev@beam.apache.org>Cc:JingsongLee <jingsongl...@gmail.com>Subject:Re: 
[PROPOSAL] External Join with KV Stores
Jinsong, what do you mean by the batch data is too large?

To my knowledge, nothing requires an SDK/runner to hold the entire side
input in memory. Lists, maps, iterables, ... can all be broken up into
smaller segments which can be loaded, cached and discarded separately.

On Thu, Aug 24, 2017 at 5:10 PM, Mingmin Xu <mingm...@gmail.com> wrote:

> wanna bring up this thread as we're looking for similar feature in SQL.
> --Please point me if something is there, I don't find any JIRA task.
>
> Now the streaming+batch/batch+batch join is implemented with sideInput.
> It's not a one-fit-all rule as Jingsong mentioned, the batch data may be
> too large, and it would be changed periodically. A userland PTransform
> sounds a more straight-forward option, as it doesn't require support in
> runner level.
>
> Mingmin
>
> On Mon, Jul 17, 2017 at 8:56 PM, JingsongLee <lzljs3620...@aliyun.com>
> wrote:
>
> > Sorry for so long to reply.
> > Hi, Aljoscha, I think Async I/O operator and Batch the same, and Async is
> > a better interface. All IO-related operations may be more appropriate
> >  for asynchronous use. Just like you said, the beginning
> > is like no any special support by the Runners.
> > I really like Luke's idea, let the user see a SeekableRea
> > d + Sideinput interface, and in the runner layer will
> > optimize it to the direct access to external
> > store. This requires a suitable SeekableRead interface and more efficient
> > compiler optimization.
> > Kenn's idea is exciting. If we can have an interface similar
> >  to FileSystem (Maybe like SeekableRead), abstract and unify a interface
> > for multiple of KV stores, we can let users to see only the concept
> > of Beam rather than the specific KVStore.
> > Best, Jingsong Lee
> > 
> --From:Kenneth
> > Knowles <k...@google.com.INVALID>Time:2017 Jul 7 (Fri) 11:43To:dev <
> > dev@beam.apache.org>Cc:JingsongLee <lzljs3620...@aliyun.com>Subject:Re:
> > [PROPOSAL] External Join with KV Stores
> > In the streams/tables way of talking, side inputs are tables. External KV
> > stores are basically also [globally windowed] tables. Both
> > are time-varying.
> >
> > I think it makes perfect sense to access an external KV store in userland
> > directly rather than listen to its changelog and reproduce the same table
> > as a multimap side input. I'm sure many users are already doing this. I'm
> > sure users will always do this. Providing a common interface (simpler
> than
> > Filesystem) and helpful transform(s) in an extension module seems nice.
> > Does it require any support in the core SDK?
> >
> > If I understand, Luke & Robert, you favor adding metadata to Read/SDF so
> > that a user _does_ write it as a changelog listener that is observed as a
> > multimap side input, and each runner optimizes it if they can to just
> > directly access the KV store? A runner is free to use any kind of storage
> > they like to materialize a side input anyhow, so this is surely possible,
> > but it is a "sufficiently smart compiler" issue. As for semantics, I'm
> not
> > worried about availability - it is globally windowed and always
> available.
> > But I think this requires retractions to be correctly equivalent to
> direct
> > access.
> >
> > I think we can have a userland PTransform in much less time than a model
> > concept, so I favor it.
> >
> > Kenn
> >
> >
>
>
> --
> 
> Mingmin
>



Re: [ANNOUNCEMENT] New committers, August 2017 edition!

2017-08-17 Thread JingsongLee
Thank you all! And congratulations to other new committers.

Best, Jingsong 
Lee--From:Mark 
Liu Time:2017 Aug 16 (Wed) 02:18To:dev 
Subject:Re: [ANNOUNCEMENT] New committers, August 2017 
edition!
Congrats! Excellent works!

On Mon, Aug 14, 2017 at 11:50 PM, Aviem Zur  wrote:

> Congrats!
>
> On Mon, Aug 14, 2017 at 6:43 PM Tyler Akidau 
> wrote:
>
> > Congrats and thanks all around!
> >
> > On Sat, Aug 12, 2017 at 12:09 AM Aljoscha Krettek 
> > wrote:
> >
> > > Congrats, everyone! It's well deserved.
> > >
> > > Best,
> > > Aljoscha
> > >
> > > > On 12. Aug 2017, at 08:06, Pei HE  wrote:
> > > >
> > > > Congratulations to all!
> > > > --
> > > > Pei
> > > >
> > > > On Sat, Aug 12, 2017 at 10:50 AM, James 
> wrote:
> > > >
> > > >> Thank you guys, glad to contribute to this great project,
> congratulate
> > > to
> > > >> all the new committers!
> > > >>
> > > >> On Sat, Aug 12, 2017 at 8:36 AM Manu Zhang  >
> > > >> wrote:
> > > >>
> > > >>> Thanks everyone !!! It's a great journey.
> > > >>> Congrats to other new committers !
> > > >>>
> > > >>> Thanks,
> > > >>> Manu
> > > >>>
> > > >>> On Sat, Aug 12, 2017 at 5:23 AM Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > >>> wrote:
> > > >>>
> > >  Congrats and welcome !
> > > 
> > >  Regards
> > >  JB
> > > 
> > >  On 08/11/2017 07:40 PM, Davor Bonaci wrote:
> > > > Please join me and the rest of Beam PMC in welcoming the
> following
> > > > contributors as our newest committers. They have significantly
> > >  contributed
> > > > to the project in different ways, and we look forward to many
> more
> > > > contributions in the future.
> > > >
> > > > * Reuven Lax
> > > > Reuven has been with the project since the very beginning,
> > > >> contributing
> > > > mostly to the core SDK and the GCP IO connectors. He accumulated
> 52
> > >  commits
> > > > (19,824 ++ / 12,039 --). Most recently, Reuven re-wrote several
> IO
> > > > connectors that significantly expanded their functionality.
> > > >>> Additionally,
> > > > Reuven authored important new design documents relating to update
> > and
> > > > snapshot functionality.
> > > >
> > > > * Jingsong Lee
> > > > Jingsong has been contributing to Apache Beam since the beginning
> > of
> > > >>> the
> > > > year, particularly to the Flink runner. He has accumulated 34
> > commits
> > > > (11,214 ++ / 6,314 --) of deep, fundamental changes that
> > > >> significantly
> > > > improved the quality of the runner. Additionally, Jingsong has
> > >  contributed
> > > > to the project in other ways too -- reviewing contributions, and
> > > > participating in discussions on the mailing list, design
> documents,
> > > >> and
> > > > JIRA issue tracker.
> > > >
> > > > * Mingmin Xu
> > > > Mingmin started the SQL DSL effort, and has driven it to the
> point
> > of
> > > > merging to the master branch. In this effort, he extended the
> > project
> > > >>> to
> > > > the significant new user community.
> > > >
> > > > * Mingming (James) Xu
> > > > James joined the SQL DSL effort, contributing some of the
> trickier
> > > >>> parts,
> > > > such as the Join functionality. Additionally, he's consistently
> > shown
> > > > himself to be an insightful code reviewer, significantly
> impacting
> > > >> the
> > > > project’s code quality and ensuring the success of the new major
> > >  component.
> > > >
> > > > * Manu Zhang
> > > > Manu initiated and developed a runner for the Apache Gearpump
> > >  (incubating)
> > > > engine, and has driven it to the point of merging to the master
> > > >> branch.
> > >  In
> > > > this effort, he accumulated 65 commits (7,812 ++ / 4,882 --) and
> > > >>> extended
> > > > the project to the new user community.
> > > >
> > > > Congratulations to all five! Welcome!
> > > >
> > > > Davor
> > > >
> > > 
> > >  --
> > >  Jean-Baptiste Onofré
> > >  jbono...@apache.org
> > >  http://blog.nanthrax.net
> > >  Talend - http://www.talend.com
> > > 
> > > >>>
> > > >>
> > >
> > >
> >
>


[PROPOSAL] External Join with KV Stores

2017-07-02 Thread JingsongLee
Hi all:
In some scenarios, the user needs to query some information from external kv 
store in the pipeline.I think we can have a good abstraction that allows users 
to get as little as possible with the underlying details.Here is a docs of this 
proposal, would like to receive your feedback.
https://docs.google.com/document/d/1B-XnUwXh64lbswRieckU0BxtygSV58hysqZbpZmk03A/edit?usp=sharing
Best, Jingsong Lee



Re: [DISCUSS] Apache Beam 2.1.0 release next week ?

2017-06-21 Thread JingsongLee
Very nice to see this release. Include the merge of DSL_SQL?
Pleased to see BEAM-1612 can be completed. (Not 
blocker)https://issues.apache.org/jira/browse/BEAM-1612
Best, JingsongLee
--From:Kenneth 
Knowles <k...@google.com.INVALID>Time:2017 Jun 22 (Thu) 10:43To:dev 
<dev@beam.apache.org>Subject:Re: [DISCUSS] Apache Beam 2.1.0 release next week ?
This sounds good to me. I have tagged some of my JIRAs with 2.1.0 as they
are known issues users have hit since the release that should just take a
minute to address. This is to pressure me, not to block the release :-)

On Wed, Jun 21, 2017 at 7:23 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi guys,
>
> As we released 2.0.0 (first stable release) last month during ApacheCon,
> and to maintain our release pace, I would like to release 2.1.0 next week.
>
> This release would include lot of bug fixes and some new features:
>
> https://issues.apache.org/jira/projects/BEAM/versions/12340528
>
> I'm volunteer to be release manager for this one.
>
> Thoughts ?
>
> Thanks,
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



Re: Reduced Availability from 17.6. - 24.6

2017-06-21 Thread JingsongLee
Have a good time~
Best, 
JingsongLee--From:Jean-Baptiste
 Onofré <j...@nanthrax.net>Time:2017 Jun 22 (Thu) 10:21To:dev 
<dev@beam.apache.org>Subject:Re: Reduced Availability from 17.6. - 24.6
Enjoy !! Well deserved !

Regards
JB

On 06/21/2017 11:37 PM, Davor Bonaci wrote:
> To tag along on the same thread, I'm out through the Independence Day
> holiday. Back on 7/5.
> 
> On Fri, Jun 16, 2017 at 10:03 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
> 
>> Hi,
>>
>> I’ll be on vacation next week, just in case anyone is wondering why I’m
>> not responding. :-)
>>
>> Best,
>> Aljoscha
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Fwd: [Report] Eagle - June 2017

2017-06-21 Thread JingsongLee
I'd like participate in doing some work.

Best, JingsongLee
--From:Davor 
Bonaci <da...@apache.org>Time:2017 Jun 22 (Thu) 06:10To:dev 
<dev@beam.apache.org>Subject:Fwd: [Report] Eagle - June 2017
From this month's Eagle report:
"Community is working on Alert engine based Apache Beam, led by YHD.com.
https://github.com/1haodian/eagle/tree/alertenginebeam;

This could be an opportunity for someone in the Beam project to reach out,
and help out with the cross-project integration.

Any volunteers? If so, please reach out to their dev@ list.

Thanks!

-- Forwarded message --
From: Edward Zhang <yonzhang2...@apache.org>
Date: Wed, Jun 14, 2017 at 10:30 PM
Subject: [Report] Eagle - June 2017
To: bo...@apache.org, priv...@eagle.apache.org


## Description:
 Apache Eagle is an open source analytics solution for identifying security
and performance issues instantly on big data platforms.

## Issues:
 There are no issues requiring board attention at this time

## Activity:
 1. In Beijing DTCC conference on May 13, 2017, Qingwen Zhao presented
"Apache Eagle - Analyze Big Data Platforms For Security and Performance",
http://dtcc.it168.com/jiabin.html
 2. In Shenzhen GOPS conference on April 22, 2017, Hao Chen presented
"Apache Eagle - Architecture evolvement and new features", http://www.
bagevent.com/event/gops2017-shenzhen
 3. In Shanghai OSC conference on May 13, 2017, Hao Chen presented "Apache
Eagle - Architecture evolvement and new features", https://www.
oschina.net/event/2236961

## Health report:
 Community is working on Alert engine based Apache Beam, led by YHD.com.
https://github.com/1haodian/eagle/tree/alertenginebeam
 Branch 0.5 is cut and release activity is based on this branch

## PMC changes:
 - Currently 16 PMC members.
 - Last PMC addition: Mon May 08 2017 (Deng Lingang)

## Committer base changes:
 - Currently 18 committers
 - Lin Gang Deng was added as a committer on Wed Mar 15 2017
 - Jay Sen was added as a committer on Thu Mar 16 2017
 - Last committer addition; Thu Mar 16 2017 (Jay Sen)

## Releases:
 - No release in last 3 months

## Mailing list activity:
- d...@eagle.apache.org:
- 74 subscribers (up 1 in the last 3 months):
- 198 emails sent to list (211 in previous quarter)

 - iss...@eagle.apache.org:
- 17 subscribers (up 1 in the last 3 months):
- 999 emails sent to list (1282 in previous quarter)

 - u...@eagle.apache.org:
- 50 subscribers (up 3 in the last 3 months):
- 33 emails sent to list (15 in previous quarter)


## JIRA activity:
 - 85 JIRA tickets created in the last 3 months
 - 83 JIRA tickets closed/resolved in the last 3 months



Re: [DISCUSS] Bundle in Flink Runner

2017-06-14 Thread JingsongLee
Thanks Aljoscha, 
I will add these related links to the docs.
Best, JingsongLee
--
From:Aljoscha Krettek <aljos...@apache.org>
Time:2017 Jun 14 (Wed) 20:56
To:dev <dev@beam.apache.org>; JingsongLee <lzljs3620...@aliyun.com>
Subject:Re: [DISCUSS] Bundle in Flink Runner
Hi,
Thanks for summarising everything that was discussed so far and also for coming 
up with a good implementation plan! The plan looks good, it will unblock proper 
bundle support on Beam because we cannot wait for Flink to support bundles the 
way we like it. 
I also want to quickly highlight these two issues because you mentioned 
bundle-processing and caching for GroupByKey/ReduceFnRunner as well: - 
https://issues.apache.org/jira/browse/BEAM-956: Execute ReduceFnRunner Directly 
in Flink Runner - https://issues.apache.org/jira/browse/BEAM-1850: Improve 
interplay between PushbackSideInputRunner and GroupAlsoByWindowViaWindowSetDoFn
They are not prerequisites for the bundling work but they are somewhat related.
Best,Aljoscha
 - 
On 13. Jun 2017, at 08:38, JingsongLee <lzljs3620...@aliyun.com> wrote:
Hi everyone,
I take a discussion to the implement of real bundle in Flink Runner.
https://docs.google.com/document/d/1UzELM4nFu8SIeu-QJkbs0sv7Uzd1Ux4aXXM3cw4s7po/edit?usp=sharing
Feel free to comment/edit it.
Best, JingsongLee




[DISCUSS] Bundle in Flink Runner

2017-06-13 Thread JingsongLee
Hi everyone,
I take a discussion to the implement of real bundle in Flink Runner.
https://docs.google.com/document/d/1UzELM4nFu8SIeu-QJkbs0sv7Uzd1Ux4aXXM3cw4s7po/edit?usp=sharing
Feel free to comment/edit it.
Best, JingsongLee


Re: [DISCUSS] Source Watermark Metrics

2017-06-08 Thread JingsongLee
Hi @Ben Chambers @Aljoscha Krettek @Aviem Zur and other all, 
I've written this up as a proposal found here: 
https://docs.google.com/document/d/1ykjjG97DjVQP73jGbotGRbtK38hGvFbokNEOuNO4DAo/edit?usp=sharing
Feel free to comment/edit it.
Best, JingsongLee


--
From:Ben Chambers <bchamb...@google.com.INVALID>
Time:2017 Jun 6 (Tue) 22:46
To:JingsongLee <lzljs3620...@aliyun.com>; dev <dev@beam.apache.org>
Subject:Re: [DISCUSS] Source Watermark Metrics

The existing metrics allow a user to report additional values to the
runner. For something like the watermark that the runner already knows
about it doesn't need to fit into the set of metrics. Since the runner
already tracks the low watermark for each operator it can just report that
as it sees fit.

This means it shouldn't matter whether the current metrics tyoes can
express it, since they are for getting user numbers into the runner.

On Sun, Jun 4, 2017, 8:27 PM JingsongLee <lzljs3620...@aliyun.com> wrote:

> I feel reporting the current low watermark for each operator is better
> than just reporting the source watermark when I see Flink 1.3 web frontend.
>
> We want the smallest watermark in all splits.  But Some runners, like
> FlinkRunner, don't have a way to get the global smallest watermark,  and
> the metric's type(Counter, Guage, Distribution) can not express it.
>
> Best,JingsongLee
> --
> From:Ben Chambers <bchamb...@google.com.INVALID>
> Time:2017 Jun 2 (Fri) 21:46
> To:dev <dev@beam.apache.org>; JingsongLee <lzljs3620...@aliyun.com>
> Cc:Aviem Zur <aviem...@gmail.com>; Ben Chambers
> <bchamb...@google.com.invalid>
> Subject:Re: [DISCUSS] Source Watermark Metrics
> I think having runners report important, general properties such as the
> source watermark is great. It is much easier than requiring every source to
> expose it.
>
> I'm not sure how we would require this or do so in a general way. Each
> runner has seperate code for handling the watermark as well as different
> ways information should be reported.
>
> Where would the runner do this? Where would the runner.put these values?
> Maybe this is just part of the documentation about what we would like
> runners to do?
>
> On Fri, Jun 2, 2017, 3:09 AM Aljoscha Krettek <aljos...@apache.org> wrote:
>
> > Hi,
> >
> > Thanks for reviving this thread. I think having the watermark is very
>
> > good. Some runners, for example Dataflow and Flink have their own internal
> > metric for the watermark but having it cross-runner seems beneficial (if
> > maybe a bit wasteful).
> >
> > Best,
> > Aljoscha
> >
> > > On 2. Jun 2017, at 03:52, JingsongLee <lzljs3620...@aliyun.com> wrote:
> > >
> > > @Aviem Zur @Ben Chambers What do you think about the value of
> > METRIC_MAX_SPLITS?
> > >
> > >
>
> > 
>--From:JingsongLee
> > <lzljs3620...@aliyun.com>Time:2017 May 11 (Thu)
> > 16:37To:dev@beam.apache.org <dev@beam.apache.org
> >Subject:[DISCUSS] Source
> > Watermark Metrics
> > > Hi everyone,
> > >
> > > The source watermark metrics show the consumer latency of Source.
> > > It allows the user to know the health of the job, or it can be used to
> > >  monitor and alarm.
> > > We should have the runner report the watermark metricsrather than
>
> > >  having the source report it using metrics. This addresses the fact that
> > even
> > > if the source has advanced to 8:00, the runner may still know about
> > buffered
>
> > >  elements at 7:00, and so not advance the watermark all the way to 8:00.
> > > The metrics Includes:
> > > 1.Source watermark (`min` amongst all splits):
> > > type = Gauge, namespace = io, name = source_watermark
> > > 2.Source watermark per split:
> > > type = Gauge, namespace = io.splits, name = .source_watermark
> > >
> > > Min Source watermark amongst all splits seems difficult to implement
> > since
> > > some runners(like FlinkRunner) can't access to all the splits to
> > aggregate
> > > and there is no such AggregatorMetric.
> > >
> > > So We could report watermark per split and users could use a `min`
>
> > > aggregation on this in their metrics backends. However, as was mentioned
>
> > > in the IO metrics proposal by several people this could be problematic in
> > > sources with many splits.
> > >
> > > So we do a check when report metrics to solve the problem of too many
> > splits.
> > > {code}
> > > if (splitsNum <= METRIC_MAX_SPLITS) {
> > >   // set the sourceWatermarkOfSplit
> > > }
> > > {code}
> > >
> > > So I'd like to take a discussion to the implement of source watermark
> > metrics
> > >  and specific how many splits is too many. (the value of
> > METRIC_MAX_SPLITS)
> > >
> > > JIRA:
> > > IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
> > > Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)
> > >
> >
> >
>
>



Re: [DISCUSS] Source Watermark Metrics

2017-06-04 Thread JingsongLee
I feel reporting the current low watermark for each operator is better than 
just reporting the source watermark when I see Flink 1.3 web frontend.

We want the smallest watermark in all splits.  But Some runners, like 
FlinkRunner, don't have a way to get the global smallest watermark,  and the 
metric's type(Counter, Guage, Distribution) can not express it.

Best,JingsongLee
--
From:Ben Chambers <bchamb...@google.com.INVALID>
Time:2017 Jun 2 (Fri) 21:46
To:dev <dev@beam.apache.org>; JingsongLee <lzljs3620...@aliyun.com>
Cc:Aviem Zur <aviem...@gmail.com>; Ben Chambers <bchamb...@google.com.invalid>
Subject:Re: [DISCUSS] Source Watermark Metrics
I think having runners report important, general properties such as the
source watermark is great. It is much easier than requiring every source to
expose it.

I'm not sure how we would require this or do so in a general way. Each
runner has seperate code for handling the watermark as well as different
ways information should be reported.

Where would the runner do this? Where would the runner.put these values?
Maybe this is just part of the documentation about what we would like
runners to do?

On Fri, Jun 2, 2017, 3:09 AM Aljoscha Krettek <aljos...@apache.org> wrote:

> Hi,
>
> Thanks for reviving this thread. I think having the watermark is very
> good. Some runners, for example Dataflow and Flink have their own internal
> metric for the watermark but having it cross-runner seems beneficial (if
> maybe a bit wasteful).
>
> Best,
> Aljoscha
>
> > On 2. Jun 2017, at 03:52, JingsongLee <lzljs3620...@aliyun.com> wrote:
> >
> > @Aviem Zur @Ben Chambers What do you think about the value of
> METRIC_MAX_SPLITS?
> >
> >
> 
>--From:JingsongLee
> <lzljs3620...@aliyun.com>Time:2017 May 11 (Thu)
> 16:37To:dev@beam.apache.org <dev@beam.apache.org>Subject:[DISCUSS] Source
> Watermark Metrics
> > Hi everyone,
> >
> > The source watermark metrics show the consumer latency of Source.
> > It allows the user to know the health of the job, or it can be used to
> >  monitor and alarm.
> > We should have the runner report the watermark metricsrather than
> >  having the source report it using metrics. This addresses the fact that
> even
> > if the source has advanced to 8:00, the runner may still know about
> buffered
> >  elements at 7:00, and so not advance the watermark all the way to 8:00.
> > The metrics Includes:
> > 1.Source watermark (`min` amongst all splits):
> > type = Gauge, namespace = io, name = source_watermark
> > 2.Source watermark per split:
> > type = Gauge, namespace = io.splits, name = .source_watermark
> >
> > Min Source watermark amongst all splits seems difficult to implement
> since
> > some runners(like FlinkRunner) can't access to all the splits to
> aggregate
> > and there is no such AggregatorMetric.
> >
> > So We could report watermark per split and users could use a `min`
> > aggregation on this in their metrics backends. However, as was mentioned
> > in the IO metrics proposal by several people this could be problematic in
> > sources with many splits.
> >
> > So we do a check when report metrics to solve the problem of too many
> splits.
> > {code}
> > if (splitsNum <= METRIC_MAX_SPLITS) {
> >   // set the sourceWatermarkOfSplit
> > }
> > {code}
> >
> > So I'd like to take a discussion to the implement of source watermark
> metrics
> >  and specific how many splits is too many. (the value of
> METRIC_MAX_SPLITS)
> >
> > JIRA:
> > IO metrics (https://issues.apache.org/jira/browse/BEAM-1919)
> > Source watermark (https://issues.apache.org/jira/browse/BEAM-1941)
> >
>
>



Re: [DISCUSS] Remove TimerInternals.deleteTimer(*) and Timer.cancel()

2017-05-08 Thread JingsongLee
+1 to remove this, I have not encountered such a strong case.
best, JingsongLee

--
From:Kenneth Knowles <k...@google.com.INVALID>
Time:2017 May 9 (Tue) 05:45
To:dev <dev@beam.apache.org>
Subject:Re: [DISCUSS] Remove TimerInternals.deleteTimer(*) and Timer.cancel()
Interesting!

I believe the only thing we need to change to remove it for FSR is
https://github.com/apache/beam/blob/master/sdks/java/core
/src/main/java/org/apache/beam/sdk/state/Timer.java#L60

Here is how I might summarize the possibilities, at the risk of having
something quite wrong:

 - Cancel as-is: some runners must manage a tombstone and/or timestamp in
state or may choose to do so for performance.

 - Cancel requires a timestamp on the backend: runners must track the
timestamp in state in order to cancel. Some runners may already need to
track this for other reasons. For special cases like the end of the window
or GC time it can be guessed, but those aren't user timers.

 - Cancel requires a timestamp from the user: Really weird. IMO implies a
timer can be set multiple times and each one is independent. This is
actually an increase in capability in perhaps an interesting way, but I
thought it was too confusing. Also might have a wacky spec around the same
timer set multiple times for the same timestamp (probably fine/idempotent)

Technically, timers are marked `@Experimental`. But, given the interest in
state and timers, making changes here will be very hard on users.

Unless someone objects with a strong case, I am comfortable removing this
from userland and potentially adding it later if there is demand.

Kenn


On Mon, May 8, 2017 at 3:26 AM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> I wanted to bring this up before the First Stable release and see what
> other people think. The methods I’m talking about are:
>
> void deleteTimer(StateNamespace namespace, String timerId, TimeDomain
> timeDomain);
>
> @Deprecated
> void deleteTimer(StateNamespace namespace, String timerId);
>
> @Deprecated
> void deleteTimer(TimerData timerKey);
>
> The last two are slated for deletion. Notice that the first one doesn’t
> take the timestamp of the timer to delete, which implies that there is only
> one active timer per namespace/timerId/timeDomain.
>
> These are my motivations for removal:
>  - Timer removal is difficult and has varying levels of support on
> different Runners and varying levels of cost.
>  - Removing the interface now and adding it back later is easy. Having it
> in the FSR and later removing it is quite hard.
>
> I can only speak about the internal Flink implementation where Timers are
> on a Heap (Java Priority Queue). Here removal is quite expensive, see, for
> example this ML thread: https://lists.apache.org/thread.html/
> ac4d8e36360779a9b5047cf21303222980015720aac478e8cf632216@%
> 3Cuser.flink.apache.org%3E. Especially this part:
>
> I thought I would drop my opinion here maybe it is relevant.
>
> We have used the Flink internal timer implementation in many of our
> production applications, this supports the Timer deletion but the deletion
> actually turned out to be a huge performance bottleneck because of the bad
> deletion performance of the Priority queue.
>
> In many of our cases deletion could have been avoided by some more clever
> registration/firing logic and we also ended up completely avoiding deletion
> and instead using "tombstone" markers by setting a flag in the state which
> timers not to fire when they actually trigger.
>
> Gyula
>
> Note that in Flink it’s not possible to delete a timer if you don’t know
> the timestamp. Other systems might store timers in more elaborate data
> structures (hierarchical timer wheels come to mind) where it’s also hard to
> delete a timer without knowing the timestamp.
>
> If we decide to keep timer removal in the user API there’s the possibility
> to simulate timer removal by keeping the timestamp of the currently active
> timer for a given timerID and checking a firing timer against that.
>
> Best,
> Aljoscha
>
>
>
>
>
>



Re: Congratulations Davor!

2017-05-04 Thread JingsongLee
Congratulations!
--
From:Jesse Anderson 
Time:2017 May 4 (Thu) 21:36
To:dev 
Subject:Re: Congratulations Davor!
Congrats!

On Thu, May 4, 2017, 6:20 AM Aljoscha Krettek  wrote:

> Congrats! :-)
> > On 4. May 2017, at 14:34, Kenneth Knowles 
> wrote:
> >
> > Awesome!
> >
> > On Thu, May 4, 2017 at 1:19 AM, Ted Yu  wrote:
> >
> >> Congratulations, Davor!
> >>
> >> On Thu, May 4, 2017 at 12:45 AM, Aviem Zur  wrote:
> >>
> >>> Congrats Davor! :)
> >>>
> >>> On Thu, May 4, 2017 at 10:42 AM Jean-Baptiste Onofré 
> >>> wrote:
> >>>
>  Congrats ! Well deserved ;)
> 
>  Regards
>  JB
> 
>  On 05/04/2017 09:30 AM, Jason Kuster wrote:
> > Hi all,
> >
> > The ASF has just published a blog post[1] welcoming new members of
> >> the
> > Apache Software Foundation, and our own Davor Bonaci is among them!
> > Congratulations and thank you to Davor for all of your work for the
> >>> Beam
> > community, and the ASF at large. Well deserved.
> >
> > Best,
> >
> > Jason
> >
> > [1] https://blogs.apache.org/foundation/entry/the-apache-sof
> > tware-foundation-welcomes
> >
> > P.S. I dug through the list to make sure I wasn't missing any other
> >>> Beam
> > community members; if I have, my sincerest apologies and please
> >>> recognize
> > them on this or a new thread.
> >
> 
>  --
>  Jean-Baptiste Onofré
>  jbono...@apache.org
>  http://blog.nanthrax.net
>  Talend - http://www.talend.com
> 
> >>>
> >>
>
> --
Thanks,

Jesse



Re: Community hackathon

2017-04-24 Thread JingsongLee
+1
best,
Jingsonglee
--From:Ted Yu 
<yuzhih...@gmail.com>Time:2017 Apr 24 (Mon) 17:29To:dev 
<dev@beam.apache.org>Subject:Re: Community hackathon
+1

> On Apr 24, 2017, at 12:51 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> 
> That's a wonderful idea !
> 
> I think the easiest way to organize this event is using the Slack channels to 
>discuss, help each other, and sync together.
> 
> Regards
> JB
> 
>> On 04/24/2017 09:48 AM, Davor Bonaci wrote:
>> We've been working as a community towards the first stable release for a
>> while now, and I think we made a ton of progress across the board over the
>> last few weeks.
>> 
>> We could try to organize a community-wide hackathon to identify and fix
>> those last few issues, as well as to get a better sense of the overall
>> project quality as it stands right now.
>> 
>> This could be a self-organized event, and coordinated via the Slack
>> channel. For example, we (as a community and participants) can try out the
>> project in various ways -- quickstart, examples, different runners,
>> different platforms -- immediately fixing issues as we run into them. It
>> could last, say, 24 hours, with people from different time zones
>> participating at the time of their choosing.
>> 
>> Thoughts?
>> 
>> Davor
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Join to external table

2017-04-13 Thread JingsongLee
Hi all,


I've seen repeatedly the following pattern:
Consider a sql (Joining stream to table, from Calcite):
SELECT STREAM o.rowtime, o.productId, o.orderId, o.units,
  p.name, p.unitPrice
FROM Orders AS o
JOIN Products AS p
  ON o.productId = p.productId;
A stream-to-table join is straightforward if the contents of the table are not 
changing(or slowly changing). This query enriches a stream of orders with 
each product’s list price.

This table is mostly in HBase or Mysql or Redis. Most of our users think that 
we should use SideInputs to implement it. But there are some difficulties here:
1.Maybe this table is very large! AFAIK, SideInputs will load all data to 
internal. 
We can not load all, but we can do some caching work. 
2.This table may be updated periodically. As mentioned in 
https://issues.apache.org/jira/browse/BEAM-1197
3.Sometimes users want to update this table, in some scene which doesn’t 
need high accuracy. (The read and write to the external storage can’t guarantee 
Exacly-Once)

So we developed a component called DimState(Maybe the name is not right). 
Use cache(It is LoadingCache now) or load all.  They all have Time-To-Live 
mechanism. An abstract interface is called ExternalState. There are 
HBaseState, JDBCState, RedisState. It is accessed by key and namespace. 
Provides bulk access to the external table for performance.

Is there a better way to implement it? Can we make some abstracts in Beam 
Model? 

What do you think?

Best,
JingsongLee


Re: Renaming SideOutput

2017-04-11 Thread JingsongLee
strong +1
best,
JingsongLee--From:Tang
 Jijun(上海_技术部_数据平台_唐觊隽) <tangji...@yhd.com>Time:2017 Apr 12 (Wed) 
10:39To:dev@beam.apache.org <dev@beam.apache.org>Subject:答复: Renaming SideOutput
+1 more clearer


-邮件原件-
发件人: Ankur Chauhan [mailto:an...@malloc64.com] 
发送时间: 2017年4月12日 10:36
收件人: dev@beam.apache.org
主题: Re: Renaming SideOutput

+1 this is pretty much the topmost things that I found odd when starting with 
the beam model. It would definitely be more intuitive to have a consistent 
name. 

Sent from my iPhone

> On Apr 11, 2017, at 18:29, Aljoscha Krettek <aljos...@apache.org> wrote:
> 
> +1
> 
>> On Wed, Apr 12, 2017, at 02:34, Thomas Groh wrote:
>> I think that's a good idea. I would call the outputs of a ParDo the 
>> "Main Output" and "Additional Outputs" - it seems like an easy way to 
>> make it clear that there's one output that is always expected, and 
>> there may be more.
>> 
>> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw < 
>> rober...@google.com.invalid> wrote:
>> 
>>> We should do some renaming in Python too. Right now we have 
>>> SideOutputValue which I'd propose naming TaggedOutput or something 
>>> like that.
>>> 
>>> Should the docs change too?
>>> https://beam.apache.org/documentation/programming-guide/#transforms-
>>> sideio
>>> 
>>> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles 
>>> <k...@google.com.invalid>
>>> wrote:
>>>> +1 ditto about sideInput and sideOutput not actually being related
>>>> 
>>>> On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw < 
>>>> rober...@google.com.invalid> wrote:
>>>> 
>>>>> +1, I think this is a lot clearer.
>>>>> 
>>>>> On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk 
>>>>> <s...@google.com.invalid>
>>>>> wrote:
>>>>>> strong +1 for changing the name away from sideOutput - the fact 
>>>>>> that sideInput and sideOutput are not really related was 
>>>>>> definitely a
>>> source
>>>>> of
>>>>>> confusion for me when learning beam.
>>>>>> 
>>>>>> S
>>>>>> 
>>>>>> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh 
>>>>>> <tg...@google.com.invalid
>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hey everyone:
>>>>>>> 
>>>>>>> I'd like to rename DoFn.Context#sideOutput to #output (in the 
>>>>>>> Java
>>> SDK).
>>>>>>> 
>>>>>>> Having two methods, both named output, one which takes the "main
>>> output
>>>>>>> type" and one that takes a tag to specify the type more clearly 
>>>>>>> communicates the actual behavior - sideOutput isn't a "special" 
>>>>>>> way
>>> to
>>>>>>> output, it's the same as output(T), just to a specified PCollection.
>>>>> This
>>>>>>> will help pipeline authors understand the actual behavior of
>>> outputting
>>>>> to
>>>>>>> a tag, and detangle it from "sideInput", which is a special way 
>>>>>>> to
>>>>> receive
>>>>>>> input. Giving them the same name means that it's not even 
>>>>>>> strange to
>>>>> call
>>>>>>> output and provide the main output type, which is what we want -
>>> it's a
>>>>>>> more specific way to output, but does not have different
>>> restrictions or
>>>>>>> capabilities.
>>>>>>> 
>>>>>>> This is also a pretty small change within the SDK - it touches 
>>>>>>> about
>>> 20
>>>>>>> files, and the changes are pretty automatic.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Thomas
>>>>>>> 
>>>>> 
>>> 


Re: StatefulDoFnRunner

2017-04-06 Thread JingsongLee
There is no suitable way to get the CurrentKey.
 I think using StepContext.timerInternals() and StepContext.stateInternals() is 
better.
best,
JingsongLee
--From:Thomas 
Weise <t...@apache.org>Time:2017 Apr 6 (Thu) 12:45To:dev 
<dev@beam.apache.org>Subject:StatefulDoFnRunner
Hi,

While working on the support for splittable DoFn, I see a few cases where
the DoFn runner classes slightly complicate reuse across elements (or make
it a bit awkward to implement for the runner).

StateInternalsStateCleaner and TimeInternalsCleanupTimer take xxxInternals
instances. But since those are key specific, the runner writer has to
perform acrobatics to flip the key on these internal instances on a per
element basis (to avoid having to recreate the other objects that refer to
them).

Would it be possible to instead use the factory and retrieve the internals
by key? The runner then has the choice to optimize as needed. In general, I
think it would be nice if the processing context related classes are
designed so that they promote reuse of object instances across elements and
bundles and minimize object creation on a per key basis?

Thanks,
Thomas



Re: Update of Pei in Alibaba

2017-04-02 Thread JingsongLee
Hi Ismaël,
We have a streaming computing platform in Alibaba.
Galaxy is an internal system, so you can't find some information from Google.
It is becoming more like StreamScope (you can search it for the paper). 
Non-global-checkpoint makes failure recovery quickly and makes streaming 
applications easier to develop and debug.


But as far as I know, each engine has its own tradeoffs, has its own good cases.
So we also developed an open source platform, which has Spark, Flink and so on.
We hope we can use Apache Beam to unify the user program model.  This will make
 the user learning costs are low, the application migration costs are low. 
(Not only from batch to streaming, but also conducive to migration from the 
streaming to the streaming.) 


--From:Ismaël 
Mejía <ieme...@gmail.com>Time:2017 Apr 2 (Sun) 03:18To:dev 
<dev@beam.apache.org>Subject:Re: Update of Pei in Alibaba
Excellent news,

Pei it would be great to have a new runner. I am curious about how
different are the implementations of storm among them considering that
there are already three 'versions': Storm, Jstorm and Heron, I wonder
if one runner could traduce to an API that would cover all of them (of
course maybe I am super naive I really don't know much about JStorm or
Heron and how much they differ from the original storm).

Jingson, I am super curious about this Galaxy project, it is there any
public information about it? is this related to the previous blink ali
baba project? I already looked a bit but searching "Ali baba galaxy"
is a recipe for a myriad of telephone sellers :)

Nice to see that you are going to keep contributing to the project Pei.

Regards,
Ismaël



On Sat, Apr 1, 2017 at 4:59 PM, Tibor Kiss <tibor.k...@gmail.com> wrote:
> Exciting times, looking forward to try it out!
>
> I shall mention that Taylor Goetz also started creating a BEAM runner using
> Storm.
> His work is available in the storm repo:
> https://github.com/apache/storm/commits/beam-runner
> Maybe it's worth while to take a peek and see if something is reusable from
> there.
>
> - Tibor
>
> On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee <lzljs3620...@aliyun.com> wrote:
>
>> Wow, very glad to see JStorm also started building BeamRunner.
>> I am working in Galaxy (Another streaming process engine in Alibaba).
>> I hope that we can work together to promote the use of Apache Beam
>> in Alibaba and China.
>>
>> best,
>> JingsongLee
>> --From:Pei
>> HE <pei...@gmail.com>Time:2017 Apr 1 (Sat) 09:24To:dev <
>> dev@beam.apache.org>Subject:Update of Pei in Alibaba
>> Hi all,
>> On February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
>> And, I want to give an update of things in here.
>>
>> A colleague and I have been working on JStorm
>> <https://github.com/alibaba/jstorm> runner. We have a prototype that works
>> with WordCount and PAssert. (I am going to start a separate email thread
>> about how to get it reviewed and merged in Apache Beam.)
>> We also have Spark clusters, and are very interested in
>> using Spark runner.
>>
>> Last Saturday, I went to China Hadoop Summit, and gave a talk about Apache
>> Beam model. While many companies gave talks of their
>> in-house solutions for
>> unified batch and unified SQL, there are also lots of interests
>> and enthusiasts of Beam.
>>
>> Looking forward to chat more.
>> --
>> Pei
>>
>>
>
>
> --
> Kiss Tibor
>
> +36 70 275 9863
> tibor.k...@gmail.com


Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-28 Thread JingsongLee
Hi Aljoscha,
I would like to work on the Flink runner with you.
Best,JingsongLee--From:Jean-Baptiste
 Onofré <j...@nanthrax.net>Time:2017 Mar 28 (Tue) 14:04To:dev 
<dev@beam.apache.org>Subject:Re: Call for help: let's add Splittable DoFn to 
Spark, Flink and Apex runners
Hi Aljoscha,

do you need some help on this ?

Regards
JB

On 03/28/2017 08:00 AM, Aljoscha Krettek wrote:
> Hi,
> sorry for being so slow but I’m currently traveling.
>
> The Flink code works but I think it could benefit from some refactoring
> to make the code nice and maintainable.
>
> Best,
> Aljoscha
>
> On Tue, Mar 28, 2017, at 07:40, Jean-Baptiste Onofré wrote:
>> I add myself on the Spark runner.
>>
>> Regards
>> JB
>>
>> On 03/27/2017 08:18 PM, Eugene Kirpichov wrote:
>>> Hi all,
>>>
>>> Let's continue the ~bi-weekly sync-ups about state of SDF support in
>>> Spark/Flink/Apex runners.
>>>
>>> Spark:
>>> Amit, Aviem, Ismaël - when would be a good time for you; does same time
>>> work (8am PST this Friday)? Who else would like to join?
>>>
>>> Flink:
>>> I pinged the PR, but - Aljoscha, do you think it's worth discussing
>>> anything there over a videocall?
>>>
>>> Apex:
>>> Thomas - how about same time next Monday? (9:30am PST) Who else would like
>>> to join?
>>>
>>> On Mon, Mar 20, 2017 at 9:59 AM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Meeting notes:
>>>> Me and Thomas had a video call and we pretty much walked through the
>>>> implementation of SDF in the runner-agnostic part and in the direct runner.
>>>> Flink and Apex are pretty similar, so likely
>>>> https://github.com/apache/beam/pull/2235 (the Flink PR) will give a very
>>>> good guideline as to how to do this in Apex.
>>>> Will talk again in ~2 weeks; and will involve +David Yan
>>>> <david...@google.com> who is also on Apex and currently conveniently
>>>> works on the Google Dataflow team and, from in-person conversation, was
>>>> interested in being involved :)
>>>>
>>>> On Mon, Mar 20, 2017 at 7:34 AM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>> Thomas - yes, 9:30 works, shall we do that?
>>>>
>>>> JB - excellent! You can start experimenting already, using direct runner!
>>>>
>>>> On Mon, Mar 20, 2017, 2:26 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>> Hi Eugene,
>>>>
>>>> Thanks for the meeting notes !
>>>>
>>>> I will be in the next call and Ismaël also provided to me some updates.
>>>>
>>>> I will sync with Amit on Spark runner and start to experiment and test SDF
>>>> on
>>>> the JMS IO.
>>>>
>>>> Thanks !
>>>> Regards
>>>> JB
>>>>
>>>> On 03/17/2017 04:36 PM, Eugene Kirpichov wrote:
>>>>> Meeting notes from today's call with Amit, Aviem and Ismaël:
>>>>>
>>>>> Spark has 2 types of stateful operators; a cheap one intended for
>>>> updating
>>>>> elements (works with state but not with timers) and an expensive one.
>>>> I.e.
>>>>> there's no efficient direct counterpart to Beam's keyed state model. In
>>>>> implementation of Beam State & Timers API, Spark runner will use the
>>>>> cheaper one for state and the expensive one for timers. So, for SDF,
>>>> which
>>>>> in the runner-agnostic SplittableParDo expansion needs both state and
>>>>> timers, we'll need the expensive one - but this should be fine since with
>>>>> SDF the bottleneck should be in the ProcessElement call itself, not in
>>>>> splitting/scheduling it.
>>>>>
>>>>> For Spark batch runner, implementing SDF might be still simpler: runner
>>>>> will just not request any checkpointing. Hard parts about SDF/batch are
>>>>> dynamic rebalancing and size estimation APIs - they will be refined this
>>>>> quarter, but it's ok to initially not have them.
>>>>>
>>>>> Spark runner might use a different expansion of SDF not involving
>>>>> KeyedWorkItem's (i.e. not overriding the GBKIntoKeyedWorkItems
>>>> transform),
>>>>> though still strivi

回复: [ANNOUNCEMENT] New committers, March 2017 edition!

2017-03-18 Thread JingsongLee
Congratulations to all!

 --原始邮件 --发件人:Stas Levin 
日期:2017-03-18 23:09:55收件人:dev@beam.apache.org 
主题:Re: [ANNOUNCEMENT] New committers, March 2017 
edition!Congrats to the new committers!

On Sat, Mar 18, 2017 at 3:44 PM Aviem Zur  wrote:

Thanks all! Very excited to join.
Congratulations to other new committers!

On Sat, Mar 18, 2017 at 2:17 AM Thomas Weise  wrote:

> Congrats!
>
>
> On Fri, Mar 17, 2017 at 4:28 PM, Chamikara Jayalath 
> wrote:
>
> > Thanks all. Congrats to other new committers !!
> >
> > I'm very excited to join.
> >
> > - Cham
> >
> > On Fri, Mar 17, 2017 at 3:02 PM Mark Liu 
> > wrote:
> >
> > > Congrats to all of them!
> > >
> > > On Fri, Mar 17, 2017 at 2:24 PM, Neelesh Salian <
> > neeleshssal...@gmail.com>
> > > wrote:
> > >
> > > > Congratulations!
> > > >
> > > > On Fri, Mar 17, 2017 at 2:16 PM, Kenneth Knowles
> >  > > >
> > > > wrote:
> > > >
> > > > > Congrats all!
> > > > >
> > > > > On Fri, Mar 17, 2017 at 2:13 PM, Davor Bonaci 
> > > wrote:
> > > > >
> > > > > > Please join me and the rest of Beam PMC in welcoming the
> following
> > > > > > contributors as our newest committers. They have significantly
> > > > > contributed
> > > > > > to the project in different ways, and we look forward to many
> more
> > > > > > contributions in the future.
> > > > > >
> > > > > > * Chamikara Jayalath
> > > > > > Chamikara has been contributing to Beam since inception, and
> > > previously
> > > > > to
> > > > > > Google Cloud Dataflow, accumulating a total of 51 commits (8,301
> > ++ /
> > > > > 3,892
> > > > > > --) since February 2016 [1]. He contributed broadly to the
> project,
> > > but
> > > > > > most significantly to the Python SDK, building the IO framework
> in
> > > this
> > > > > SDK
> > > > > > [2], [3].
> > > > > >
> > > > > > * Eugene Kirpichov
> > > > > > Eugene has been contributing to Beam since inception, and
> > previously
> > > to
> > > > > > Google Cloud Dataflow, accumulating a total of 95 commits
(22,122
> > ++
> > > /
> > > > > > 18,407 --) since February 2016 [1]. In recent months, he’s been
> > > driving
> > > > > the
> > > > > > Splittable DoFn effort [4]. A true expert on IO subsystem,
Eugene
> > has
> > > > > > reviewed nearly every IO contributed to Beam. Finally, Eugene
> > > > contributed
> > > > > > the Beam Style Guide, and is championing it across the project.
> > > > > >
> > > > > > * Ismaël Mejia
> > > > > > Ismaël has been contributing to Beam since mid-2016,
> accumulating a
> > > > total
> > > > > > of 35 commits (3,137 ++ / 1,328 --) [1]. He authored the HBaseIO
> > > > > connector,
> > > > > > helped on the Spark runner, and contributed in other areas as
> well,
> > > > > > including cross-project collaboration with Apache Zeppelin.
> Ismaël
> > > > > reported
> > > > > > 24 Jira issues.
> > > > > >
> > > > > > * Aviem Zur
> > > > > > Aviem has been contributing to Beam since early fall,
> accumulating
> > a
> > > > > total
> > > > > > of 49 commits (6,471 ++ / 3,185 --) [1]. He reported 43 Jira
> > issues,
> > > > and
> > > > > > resolved ~30 issues. Aviem improved the stability of the Spark
> > > runner a
> > > > > > lot, and introduced support for metrics. Finally, Aviem is
> > > championing
> > > > > > dependency management across the project.
> > > > > >
> > > > > > Congratulations to all four! Welcome!
> > > > > >
> > > > > > Davor
> > > > > >
> > > > > > [1]
> > > > > > https://github.com/apache/beam/graphs/contributors?from=
> > > > > > 2016-02-01=2017-03-17=c
> > > > > > [2]
> > > > > > https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> > > > > > apache_beam/io/iobase.py#L70
> > > > > > [3]
> > > > > > https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> > > > > > apache_beam/io/iobase.py#L561
> > > > > > [4] https://s.apache.org/splittable-do-fn
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Neelesh S. Salian
> > > >
> > >
> >
>


回复:How to implement Timer in runner

2017-01-27 Thread JingsongLee
Thanks for the reply.Maybe we need a external priority queue.Happy Chinese New 
Year!
--发件人:Aljoscha 
Krettek 发送时间:2017年1月25日(星期三) 18:38收件人:dev 
; lzljs3620320 ; Kenneth Knowles 
主 题:Re: How to implement Timer in runner
Hi Jingsong,you're right, it is indeed somewhat tricky to find a good data 
structure for out-of-core timers. That's why we have them in memory in Flink 
for now and that's also why I'm afraid I don't have any good advice for you 
right now. We're aware of the problem in Flink but we're not yet working on a 
concrete solution.
Cheers,Aljoscha
On Tue, 24 Jan 2017 at 21:42 Dan Halperin  wrote:
Hi Jingsong,

Sorry for the delayed response; this email ended up being misclassified by
my mail server and I missed it. Maybe Kenn or Aljoscha has suggestions on
how runners can best implement timers?

Dan

On Thu, Jan 19, 2017 at 9:55 PM, lzljs3620320 
wrote:

> Hi there,
> I'm working on the beam integration for an internal system at Alibaba. Now
> most of the runners put timers in memory, such as Flink, Apex, etc. (I do not 
> know
> the implementation of Google Dataflow).But in our  scene, unbounded data
> has a large number of keys,which will lead to OOM(timers in memory). So
> we want to store timers in state(RocksDb in disk).The problem is how to
> extract fired event time timers when refresh the input
> watermark. Do we have to scan all keys and timers(Now timer is composed of
> Key, id, namespace, timestamp, domain)?Is there a better
> implement? I'm wondering if you could give me some advice on how to implement
> timers in state efficiently. Thank you!
> Best,Jingsong Lee