Re: BeamSQL status and merge to master

2017-07-21 Thread Tyler Akidau
There are still open items in the merge to master doc. We're close to being
ready, but let's please address those first.

-Tyler


On Thu, Jul 20, 2017 at 9:34 PM Mingmin Xu  wrote:

> Quick update:
>
> The final PR[1] is open to review now. Please leave your comment or create
> a sub-task in [2] for any question.
>
> [1]. https://github.com/apache/beam/pull/3606
> [2]. https://issues.apache.org/jira/browse/BEAM-2651
>
>
> On Wed, Jul 5, 2017 at 3:34 PM, Jesse Anderson 
> wrote:
>
> > So excited to start using this!
> >
> > On Wed, Jul 5, 2017, 3:34 PM Mingmin Xu  wrote:
> >
> > > Thanks for everybody's effort, we're very close to finish existing
> tasks.
> > > Here's an status update of SQL DSL, feel free to have a try and share
> any
> > > comment:
> > >
> > > *1. what's done*
> > >   DSL feature is done, with basic
> filter/project/aggregation/union/join,
> > > built-in functions/UDF/UDAF(pending on #3491)
> > >
> > > *2. what's on-going*
> > >   more unit tests, and documentation of README/Beam web.
> > >
> > > *3. open questions*
> > >   BEAM-2441  want to
> > see
> > > any suggestion on the proper module name for SQL work. As mentioned in
> > > task, '*dsl/sql* is for the Java SDK and also prevents alternative
> > language
> > > implementations, however there's another SQL client and not good to be
> > > included as Java SDK extention'.
> > >
> > > ---
> > > *How to run the example* beam/dsls/sql/example/BeamSqlExample.java
> > > <
> > > https://github.com/apache/beam/blob/DSL_SQL/dsls/sql/src/mai
> > n/java/org/apache/beam/dsls/sql/example/BeamSqlExample.java
> > > >
> > > 1. run 'mvn install' to avoid the error in #3439
> > > 
> > > 2. run 'mvn -pl dsls/sql compile exec:java
> > > -Dexec.mainClass=org.apache.beam.dsls.sql.example.BeamSqlExample
> > > -Dexec.args="--runner=DirectRunner" -Pdirect-runner'
> > >
> > > FYI:
> > > 1. burn-down list in google doc
> > >
> > > https://docs.google.com/document/d/1EHZgSu4Jd75iplYpYT_K_JwS
> > ZxL2DWG8kv_EmQzNXFc/edit?usp=sharing
> > > 2. JIRA tasks with label 'dsl_sql_merge'
> > >
> > > https://issues.apache.org/jira/browse/BEAM-2555?jql=labels%2
> > 0%3D%20dsl_sql_merge
> > >
> > >
> > > Mingmin
> > >
> > > On Tue, Jun 13, 2017 at 8:51 AM, Lukasz Cwik  >
> > > wrote:
> > >
> > > > Nevermind, I merged it into #2 about usability.
> > > >
> > > > On Tue, Jun 13, 2017 at 8:50 AM, Lukasz Cwik 
> wrote:
> > > >
> > > > > I added a section about maven module structure/packaging (#6).
> > > > >
> > > > > On Tue, Jun 13, 2017 at 8:30 AM, Tyler Akidau
> > >  > > > >
> > > > > wrote:
> > > > >
> > > > >> Thanks Mingmin. I've copied your list into a doc[1] to make it
> > easier
> > > to
> > > > >> collaborate on comments and edits.
> > > > >>
> > > > >> [1] https://s.apache.org/beam-dsl-sql-burndown
> > > > >>
> > > > >> -Tyler
> > > > >>
> > > > >>
> > > > >> On Mon, Jun 12, 2017 at 10:09 PM Jean-Baptiste Onofré <
> > > j...@nanthrax.net>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Mingmin
> > > > >> >
> > > > >> > Sorry, the meeting was in the middle of the night for me and I
> > > wasn't
> > > > >> able
> > > > >> > to
> > > > >> > make it.
> > > > >> >
> > > > >> > The timing and checklist look good to me.
> > > > >> >
> > > > >> > We plan to do a Beam release end of June, so, merging in July
> > means
> > > we
> > > > >> can
> > > > >> > include it in the next release.
> > > > >> >
> > > > >> > Thanks !
> > > > >> > Regards
> > > > >> > JB
> > > > >> >
> > > > >> > On 06/13/2017 03:06 AM, Mingmin Xu wrote:
> > > > >> > > Hi all,
> > > > >> > >
> > > > >> > > Thanks to join the meeting. As discussed, we're planning to
> > merge
> > > > >> DSL_SQL
> > > > >> > > branch back to master, targeted in the middle of July. A tag
> > > > >> > > 'dsl_sql_merge'[1] is created to track all todo tasks.
> > > > >> > >
> > > > >> > > *What's added in Beam SQL?*
> > > > >> > > BeamSQL provides the capability to execute SQL queries with
> Beam
> > > > Java
> > > > >> > SDK,
> > > > >> > > either by translating SQL to a PTransform, or run with a
> > > standalone
> > > > >> CLI
> > > > >> > > client.
> > > > >> > >
> > > > >> > > *Checklist for merge:*
> > > > >> > > 1. functionality
> > > > >> > >1.1. SQL grammer:
> > > > >> > >  1.1.1. basic query with SELECT/FILTER/PROJECT;
> > > > >> > >  1.1.2. AGGREGATION with global window;
> > > > >> > >  1.1.3. AGGREGATION with FIX_TIME/SLIDING_TIME/SESSION
> > window;
> > > > >> > >  1.1.4. JOIN
> > > > >> > >1.2. UDF/UDAF support;
> > > > >> > >1.3. support predefined String/Math/Date functions, see[2];
> > > > >> > >
> > > > >> > > 2. DSL interface to convert SQL as PTransform;
> > > > >> > >
> > > > >> > > 3. junit test;
> > > > >> > >
> > > > >> > > 4. Java document;
> > > > >> > 

Re: [PROPOSAL] Connectors for memcache and Couchbase

2017-07-21 Thread Madhusudan Borkar
 As suggested by @Ismaël, we are planning to start development next week
using the original proposal as the first version. We would appreciate if
there more ideas about changing the proposal, please let us know otherwise
we will go ahead.
Thanks

Madhu Borkar

On Thu, Jul 20, 2017 at 8:54 PM, Seshadri Raghunathan <
sraghunat...@etouch.net> wrote:

> Yes, that is correct !
>
> Regards,
> Seshadri
>
> -Original Message-
> From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
> Sent: Thursday, July 20, 2017 5:05 PM
> To: dev@beam.apache.org
> Subject: Re: [PROPOSAL] Connectors for memcache and Couchbase
>
> Hi,
>
> So, in short, the plan is:
> - Implement write() that writes a PC> to memcached. Can
> be done as a simple ParDo with some batching to take advantage of multi-put
> operation.
> - Implement lookup() that converts a PC (keys) to PC byte[]>> (keys paired with values). Can also be done as a simple ParDo with
> some batching to take advantage of multi-get operation.
> spymemcached takes care of everything else (e.g. distributing a batched
> get/put onto the proper servers), so the code of the transform will be
> basically trivial - which is great.
>
> Correct?
>
> On Thu, Jul 20, 2017 at 2:54 PM Seshadri Raghunathan <
> sraghunat...@etouch.net> wrote:
>
> > Thanks Lukasz, Eugene & Ismaël for your inputs.
> >
> > Please find below my comments on various aspects of this proposal -
> >
> > A. read / lookup - takes in a PCollection and transforms it into
> > a PCollection>
> >
> > --
> > ---
> >
> >This is a simple lookup rather than a full read / scan.
> >
> >Splits -
> >-
> >Our idea is similar to Eugene & Ismaeel on splits.
> > There is no concept of a split for a 'get' operation , internally the
> > client
> > API(spymemcached) calculates an hash for the key and the memcache
> > server node mapping to that hashvalue is probed for lookup.
> > spymemcached API supports a multi-get/lookup operation which takes in
> > a bunch of keys, identifies specific server node (from server farm)
> > for each of these keys and groups them by the server node they map to.
> > The API also provides a way to enable consistent hashing. Each of
> > these {server node - keys list} is grouped as an 'Operation' and
> > enqueued to appropriate server nodes and the lookup is done in an
> > asynchronous manner. reference -
> > https://github.com/couchbase/spymemcached/blob/master/src/main/java/ne
> > t/spy/memcached/MemcachedClient.java#L1274
> > . All this is done under the hoood by spymemcached API. One way to
> > achieve splitting explicitly would be to instantiate a separate
> > spymemcached client for each of the server nodes and treat each of them
> as a separate split.
> > However in this case the split doesn't make sense as for a given
> > key/hashvalue we need not probe all the servers, simply probing the
> > server node that the hashvalue maps to should suffice. Instead,
> > considering a more granular split at a 'slab' level (per Lukasz
> > inputs) by using lru_crawler metadump operations is another way to
> > look at it. This approach may not be ideal for this operation as we
> > could end up reading all the slabs as an overkill.
> >
> >Consistency -
> >--
> >We concur with Ismaeel's thoughts here, 'get' is a
> > point-in-time operation and will transparently reflect the value that
> > is bound with a particular key at a given point of time in the
> > memcache keystore. This is similar to reading a FileSystem or querying
> > a database etc at a specific time and returning the contents / resultset.
> >
> > B. write - takes in a PCollection and writes it to the
> > memcache
> >
> > --
> > --
> >
> > C. Other operations / mutations :
> > ---
> > Other operations that can be supported in subsequent iteration - add,
> > cas, delete, replace, gets( get with CAS ) There are a few more
> > operations such as incr, decr, append, prepend etc which needs a
> > broader discussion on whether to implememnt them in the transform.
> >
> > A few points on other proposals -
> >
> > Full read Vs Key based read -
> > --
> > We think that a key based read makes more sense here as it seems to be
> > the primary usecase for memcache. Most of the applications using
> > memcache use it as a key-value lookup store and hence makes sense to
> > build on the same principles while developing a connector in Apache
> > Beam. Also please note that key-value set/lookup is what all memcache
> > implementations do best, though there are other 

Re: [PROPOSAL] Merge gearpump-runner to master

2017-07-21 Thread Kenneth Knowles
+1 to this!

I really want to call out the longevity of contribution behind this,
following many changes in both Beam and Gearpump for over a year. Here's
the first commit on the branch:

commit 9478f4117de3a2d0ea40614ed4cb801918610724 (github/pr/323)
Author: manuzhang 
Date:   Tue Mar 15 16:15:16 2016 +0800

And here are some numbers, FWIW: 163 non-merge commits, 203 total. So
that's a PR and review every couple of weeks.

The ValidatesRunner capability coverage is very good. The only skipped
tests are state/timers, metrics, and TestStream, which many runners have
partial or no support for.

I'll save practical TODOs like moving ValidatesRunner execution to
postcommit, etc. Pending the results of this discussion, of course.

Kenn


On Fri, Jul 21, 2017 at 12:02 AM, Manu Zhang 
wrote:

> Guys,
>
> On behalf of the gearpump team, I'd like to propose to merge the
> gearpump-runner branch into master, which will give it more visibility to
> other contributors and users. The runner satisfies the following criteria
> outlined in contribution guide [1].
>
>
>1. Have at least 2 contributors interested in maintaining it, and 1
>committer interested in supporting it: *Both Huafeng and me have been
>making contributions[2] and we will continue to maintain it. Kenn and JB
>have been supporting the runner (Thank you, guys!)*
>2. Provide both end-user and developer-facing documentation*: They are
>already on the website ([3] and [4]).*
>3. Have at least a basic level of unit test coverage: *We do.* *[5]*
>4. Run all existing applicable integration tests with other Beam
>components and create additional tests as appropriate: *gearpump-runner
>passes ValidatesRunner tests.*
>
>
> Additionally, as a runner,
>
>
>1. Be able to handle a subset of the model that address a significant
>set of use cases (aka. ‘traditional batch’ or ‘processing time
> streaming’): *gearpump
>runner is able to handle event time streaming *
>2. Update the capability matrix with the current status: *[4]*
>3. Add a webpage under documentation/runners: *[3]*
>
>
> The PR for the merge: https://github.com/apache/beam/pull/3611
>
> Thanks,
> Manu
>
>
> [1] http://beam.apache.org/contribute/contribution-guide/#feature-branches
> [2] https://issues.apache.org/jira/browse/BEAM-79
> [3] https://beam.apache.org/documentation/runners/gearpump/
> [4] https://beam.apache.org/documentation/runners/capability-matrix/
> [5]
> https://github.com/apache/beam/tree/gearpump-runner/
> runners/gearpump/src/test/java/org/apache/beam/runners/gearpump
>


Re: Should Pipeline wait till all processing time timers fire before exit?

2017-07-21 Thread Kenneth Knowles
I think the best answer is "yes" we should fire all timers before exit.

This is the subject of https://issues.apache.org/jira/browse/BEAM-2535 which
is a fairly significant enhancement to the model. In this proposal, every
timer is treated like an input with a timestamp and that is independent of
the specification of when to deliver the input.

Right now, processing time timers have no event time timestamp associated
with them, nor any watermark hold. So the window expires and they are
dropped as late data eventually. This is correct according to the current
situation, but we should change it.

However, I don't think a pipeline should necessarily actually wait in
processing time. One of the main uses of the unified batch/streaming model
is to do historical re-processing using the same logic that you used for
real-time processing. So in a historical "batch" query, you want all the
same callbacks, but you should call them as fast as possible. Semantically,
it is the same as a fast clock / slow computation anyhow.

Kenn

On Fri, Jul 21, 2017 at 6:38 AM, Shen Li  wrote:

> If max watermarks arrive at all transforms before some processing time
> timers fire, should the Pipeline wait till all timers fire before turning
> to DONE state?
>
> Thanks,
> Shen
>


Jenkins build is unstable: beam_Release_NightlySnapshot #484

2017-07-21 Thread Apache Jenkins Server
See