[jira] [Created] (BEAM-126) KafkaWindowedWordCountExample fails with JobName invalid

2016-03-18 Thread William McCarthy (JIRA)
William McCarthy created BEAM-126:
-

 Summary: KafkaWindowedWordCountExample fails with JobName invalid
 Key: BEAM-126
 URL: https://issues.apache.org/jira/browse/BEAM-126
 Project: Beam
  Issue Type: Bug
  Components: runner-flink
Reporter: William McCarthy
Assignee: Maximilian Michels


I get the following when I try to run the KafkaWindowedWordCountExample.

I'm able to fix it by changing line 106 of that file to:
options.setJobName("kafkawindowword" + options.getWindowSize() + "seconds");


flink run -c 
org.apache.beam.runners.flink.examples.streaming.KafkaWindowedWordCountExample 
target/beam-1.0-SNAPSHOT.jar 
test cl-mdgy:2181 cl-pu4p:9092,cl-y06o:9093 mygroup


 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error.
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:520)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
at org.apache.flink.client.program.Client.runBlocking(Client.java:248)
at 
org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:866)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1189)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1239)
Caused by: java.lang.RuntimeException: Failed to construct instance from 
factory method FlinkPipelineRunner#fromOptions(interface 
com.google.cloud.dataflow.sdk.options.PipelineOptions)
at 
com.google.cloud.dataflow.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:233)
at 
com.google.cloud.dataflow.sdk.util.InstanceBuilder.build(InstanceBuilder.java:162)
at 
com.google.cloud.dataflow.sdk.runners.PipelineRunner.fromOptions(PipelineRunner.java:57)
at com.google.cloud.dataflow.sdk.Pipeline.create(Pipeline.java:134)
at 
org.apache.beam.runners.flink.examples.streaming.KafkaWindowedWordCountExample.main(KafkaWindowedWordCountExample.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:505)
... 6 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.google.cloud.dataflow.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:222)
... 15 more
Caused by: java.lang.IllegalArgumentException: JobName invalid; the name must 
consist of only the characters [-a-z0-9], starting with a letter and ending 
with a letter or number
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at 
org.apache.beam.runners.flink.FlinkPipelineRunner.fromOptions(FlinkPipelineRunner.java:92)
... 20 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


incubator-beam git commit: [BEAM-126] remove strict job name check

2016-03-18 Thread mxm
Repository: incubator-beam
Updated Branches:
  refs/heads/master 5b5c0e28f -> 81d5ff5a5


[BEAM-126] remove strict job name check


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/81d5ff5a
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/81d5ff5a
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/81d5ff5a

Branch: refs/heads/master
Commit: 81d5ff5a561ebcf323caea5bdc4363353e5e60dd
Parents: 5b5c0e2
Author: Maximilian Michels 
Authored: Fri Mar 18 15:46:36 2016 +0100
Committer: Maximilian Michels 
Committed: Fri Mar 18 16:01:48 2016 +0100

--
 .../runners/flink/FlinkPipelineExecutionEnvironment.java | 4 ++--
 .../org/apache/beam/runners/flink/FlinkPipelineRunner.java   | 8 
 2 files changed, 2 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/81d5ff5a/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
--
diff --git 
a/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
 
b/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
index 8825ed3..6f93478 100644
--- 
a/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
+++ 
b/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
@@ -141,7 +141,7 @@ public class FlinkPipelineExecutionEnvironment {
   if (this.flinkPipelineTranslator == null) {
 throw new RuntimeException("FlinkPipelineTranslator not initialized.");
   }
-  return this.flinkStreamEnv.execute();
+  return this.flinkStreamEnv.execute(options.getJobName());
 } else {
   if (this.flinkBatchEnv == null) {
 throw new RuntimeException("FlinkPipelineExecutionEnvironment not 
initialized.");
@@ -149,7 +149,7 @@ public class FlinkPipelineExecutionEnvironment {
   if (this.flinkPipelineTranslator == null) {
 throw new RuntimeException("FlinkPipelineTranslator not initialized.");
   }
-  return this.flinkBatchEnv.execute();
+  return this.flinkBatchEnv.execute(options.getJobName());
 }
   }
 

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/81d5ff5a/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java
--
diff --git 
a/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java
 
b/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java
index fe773d9..4f53e35 100644
--- 
a/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java
+++ 
b/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java
@@ -87,14 +87,6 @@ public class FlinkPipelineRunner extends 
PipelineRunner {
   LOG.debug("Classpath elements: {}", flinkOptions.getFilesToStage());
 }
 
-// Verify jobName according to service requirements.
-String jobName = flinkOptions.getJobName().toLowerCase();
-Preconditions.checkArgument(jobName.matches("[a-z]([-a-z0-9]*[a-z0-9])?"), 
"JobName invalid; " +
-"the name must consist of only the characters " + "[-a-z0-9], starting 
with a letter " +
-"and ending with a letter " + "or number");
-Preconditions.checkArgument(jobName.length() <= 40,
-"JobName too long; must be no more than 40 characters in length");
-
 // Set Flink Master to [auto] if no option was specified.
 if (flinkOptions.getFlinkMaster() == null) {
   flinkOptions.setFlinkMaster("[auto]");



[jira] [Created] (BEAM-124) Testing -- End to End WordCount Batch and Streaming Tests

2016-03-18 Thread Steve Wheeler (JIRA)
Steve Wheeler created BEAM-124:
--

 Summary: Testing -- End to End WordCount Batch and Streaming Tests
 Key: BEAM-124
 URL: https://issues.apache.org/jira/browse/BEAM-124
 Project: Beam
  Issue Type: New Feature
  Components: testing
Reporter: Steve Wheeler
Assignee: Davor Bonaci


Set up testing infrastructure so that an end to end test for WordCount (both 
batch and streaming) will be run periodically. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[2/2] incubator-beam git commit: This closes #54

2016-03-18 Thread kenn
This closes #54


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/6ba288d6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/6ba288d6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/6ba288d6

Branch: refs/heads/master
Commit: 6ba288d674ea046786cbbe4a057e0d6b71deba54
Parents: a13dd29 328a147
Author: Kenneth Knowles 
Authored: Wed Mar 16 19:17:29 2016 -0700
Committer: Kenneth Knowles 
Committed: Wed Mar 16 19:17:29 2016 -0700

--
 .../cloud/dataflow/sdk/transforms/windowing/AfterWatermark.java  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--




[3/3] incubator-beam-site git commit: This closes pull request 2

2016-03-18 Thread takidau
This closes pull request 2


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/b04b002c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/b04b002c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/b04b002c

Branch: refs/heads/asf-site
Commit: b04b002caf3c2552a5afa8d674e0a3899fe58ab7
Parents: a8ebbad 931d7f5
Author: Tyler Akidau 
Authored: Thu Mar 17 15:50:38 2016 -0700
Committer: Tyler Akidau 
Committed: Thu Mar 17 15:50:38 2016 -0700

--
 _data/authors.yml   |   8 +
 _data/capability-matrix.yml | 561 +
 _includes/authors-list.md   |   1 +
 _includes/capability-matrix-common.md   |   7 +
 _includes/capability-matrix-row-blog.md |   1 +
 _includes/capability-matrix-row-full.md |   1 +
 _includes/capability-matrix-row-summary.md  |   1 +
 _includes/capability-matrix.md  |  28 +
 _includes/header.html   |  13 +-
 _layouts/post.html  |   4 +-
 _pages/blog.md  |   5 +-
 _pages/capability-matrix.md |  41 +
 _posts/2016-02-22-beam-has-a-logo.markdown  |   3 +-
 _posts/2016-02-22-beam-has-a-logo0.markdown |   3 +-
 _posts/2016-03-17-compatability-matrix.md   | 596 ++
 _sass/capability-matrix.scss| 127 +++
 .../python/sdk/2016/02/25/beam-has-a-logo0.html |   8 +-
 .../website/2016/02/22/beam-has-a-logo.html |  10 +-
 content/blog/index.html | 795 +-
 content/feed.xml| 807 ++-
 content/getting_started/index.html  |   1 +
 content/index.html  |   3 +
 content/issue_tracking/index.html   |   1 +
 content/mailing_lists/index.html|   1 +
 content/privacy_policy/index.html   |   1 +
 content/source_repository/index.html|  10 +-
 content/styles/site.css | 107 +++
 content/team/index.html |   1 +
 styles/site.scss|   1 +
 29 files changed, 3118 insertions(+), 28 deletions(-)
--




[jira] [Commented] (BEAM-133) Test flakiness in the Spark runner

2016-03-18 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201831#comment-15201831
 ] 

Amit Sela commented on BEAM-133:


I guess our Jenkins uses multiple instances / machines, does it ? 
If so, can we check if the failure is on the same one ?

Anyway it's a cache issue. 

> Test flakiness in the Spark runner
> --
>
> Key: BEAM-133
> URL: https://issues.apache.org/jira/browse/BEAM-133
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Davor Bonaci
>Assignee: Amit Sela
>
> Jenkins shows some flakiness in the Spark runner in the context of an 
> unrelated pre-commit test.
> {code}
> Results :
> Tests in error: 
>   AvroPipelineTest.testGeneric:75 » Runtime java.io.IOException: Could 
> not creat...
>   NumShardsTest.testText:77 » Runtime java.io.IOException: Could not 
> create File...
>   HadoopFileFormatPipelineTest.testSequenceFile:83 » Runtime 
> java.io.IOException...
>   
> TransformTranslatorTest.testTextIOReadAndWriteTransforms:76->runPipeline:96 » 
> Runtime
>   KafkaStreamingTest.testRun:121 » Runtime java.io.IOException: failure 
> to login...
> Tests run: 27, Failures: 0, Errors: 5, Skipped: 0
> [ERROR] There are test failures.
> {code}
> https://builds.apache.org/job/beam_PreCommit/98/console
> Amit, does this sounds like a test code issue or the infrastructure issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-123) Skip header row from csv

2016-03-18 Thread Davin Pidoto (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200315#comment-15200315
 ] 

Davin Pidoto commented on BEAM-123:
---

Hi Daniel,

We are not using any csv reader libraries.  We simply read the file from GCS as 
outlined in the DataFlow docs then split the string in a ParDo.

String[] event = context.element().toString().split(",");

https://cloud.google.com/dataflow/model/text-io

> Skip header row from csv 
> -
>
> Key: BEAM-123
> URL: https://issues.apache.org/jira/browse/BEAM-123
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Reporter: Davin Pidoto
>Priority: Minor
>
> Add functionality to skip header rows when reading from a csv file.
> http://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[1/3] incubator-beam-site git commit: Add content files missed in PR2 due to an overly aggressive .gitignore filter

2016-03-18 Thread davor
Repository: incubator-beam-site
Updated Branches:
  refs/heads/asf-site b04b002ca -> 2f7336f0e


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/04e1fbbb/content/capability-matrix/index.html
--
diff --git a/content/capability-matrix/index.html 
b/content/capability-matrix/index.html
new file mode 100644
index 000..7047b1f
--- /dev/null
+++ b/content/capability-matrix/index.html
@@ -0,0 +1,1652 @@
+
+
+
+  
+  
+  
+  
+
+  Apache Beam Capability Matrix
+  
+
+  
+  
+  https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";>
+  
+  http://beam.incubator.apache.org/capability-matrix/;>
+  http://beam.incubator.apache.org/feed.xml;>
+  
+
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+ga('create', 'UA-73650088-1', 'auto');
+ga('send', 'pageview');
+
+  
+  
+
+
+
+  
+
+
+  
+
+  
+
+  
+
+
+  
+
+  Documentation 
+  
+Getting Started
+   Capability Matrix
+https://goo.gl/ps8twC;>Technical Docs
+https://goo.gl/nk5OM0;>Technical Vision
+  
+
+
+  Community 
+  
+Community
+Mailing Lists
+https://goo.gl/ps8twC;>Technical Docs
+https://goo.gl/nk5OM0;>Technical Vision
+Apache Beam Team
+
+Contribute
+Source Repository
+Issue Tracking
+  
+
+Blog
+  
+
+  
+
+
+
+
+
+
+
+
+  
+Apache Beam Capability 
Matrix
+
+Apache Beam (incubating) provides a portable API layer for building 
sophisticated data-parallel processing engines that may be executed across a 
diversity of exeuction engines, or runners. The core concepts of this 
layer are based upon the Beam Model (formerly referred to as the http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf;>Dataflow Model), and 
implemented to varying degrees in each Beam runner. To help clarify the 
capabilities of individual runners, we’ve created the capability matrix 
below.
+
+Individual capabilities have been grouped by their corresponding What / Where 
/ When / How question:
+
+
+  What results are being 
calculated?
+  Where in event time?
+  When in processing time?
+  How do refinements of results 
relate?
+
+
+For more details on the What / Where / When 
/ How breakdown of concepts, we recommend 
reading through the http://oreilly.com/ideas/the-world-beyond-batch-streaming-102;>Streaming 
102 post on O’Reilly Radar.
+
+Note that in the future, we intend to add additional tables beyond the 
current set, for things like runtime characterstics (e.g. at-least-once vs 
exactly-once), performance, etc.
+
+
+  function ToggleTables(showDetails, anchor) {
+document.getElementById("cap-summary").style.display = showDetails ? 
"none" : "block";
+document.getElementById("cap-full").style.display = showDetails ? "block" 
: "none";
+location.hash = anchor;
+  }
+
+
+
+
+
+
+  
+  
+(expand details)What is being computed?
+  
+  
+
+  
+Beam Model
+  
+Google Cloud Dataflow
+  
+Apache Flink
+  
+Apache Spark
+  
+  
+  
+  
+ParDo
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+  
+  
+  
+GroupByKey
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+~
+
+
+  
+  
+  
+Flatten
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+  
+  
+  
+Combine
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+  
+  
+  
+Composite Transforms
+
+
+
+
+
+
+
+
+~
+
+
+
+
+~
+
+
+
+
+~
+
+
+  
+  
+  
+Side Inputs
+
+
+
+
+
+
+
+
+
+
+
+
+
+~ (https://issues.apache.org/jira/browse/BEAM-102;>BEAM-102)
+
+
+
+
+~
+
+
+  
+  
+  
+Source API
+
+
+
+
+
+
+
+
+
+
+
+
+
+~ (https://issues.apache.org/jira/browse/BEAM-103;>BEAM-103)
+
+
+
+
+
+
+
+  
+  
+  
+Aggregators
+
+
+
+~
+
+
+
+
+~
+
+
+
+
+~
+
+
+
+
+~
+
+
+  
+  
+  
+Keyed State
+
+
+
+ (https://issues.apache.org/jira/browse/BEAM-25;>BEAM-25)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+  
+  
+  
+
+  
+  
+  
+(expand details)Where in event time?
+  
+  
+
+  
+Beam Model
+  
+Google Cloud Dataflow
+  
+Apache Flink
+  
+Apache Spark
+  

[3/8] incubator-beam git commit: Move Java 8 examples to their own module

2016-03-18 Thread kenn
http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/e5c7cf88/java8examples/src/main/java/com/google/cloud/dataflow/examples/complete/game/README.md
--
diff --git 
a/java8examples/src/main/java/com/google/cloud/dataflow/examples/complete/game/README.md
 
b/java8examples/src/main/java/com/google/cloud/dataflow/examples/complete/game/README.md
new file mode 100644
index 000..79b55ce
--- /dev/null
+++ 
b/java8examples/src/main/java/com/google/cloud/dataflow/examples/complete/game/README.md
@@ -0,0 +1,113 @@
+
+# 'Gaming' examples
+
+
+This directory holds a series of example Dataflow pipelines in a simple 'mobile
+gaming' domain. They all require Java 8.  Each pipeline successively introduces
+new concepts, and gives some examples of using Java 8 syntax in constructing
+Dataflow pipelines. Other than usage of Java 8 lambda expressions, the concepts
+that are used apply equally well in Java 7.
+
+In the gaming scenario, many users play, as members of different teams, over
+the course of a day, and their actions are logged for processing. Some of the
+logged game events may be late-arriving, if users play on mobile devices and go
+transiently offline for a period.
+
+The scenario includes not only "regular" users, but "robot users", which have a
+higher click rate than the regular users, and may move from team to team.
+
+The first two pipelines in the series use pre-generated batch data samples. The
+second two pipelines read from a [PubSub](https://cloud.google.com/pubsub/)
+topic input.  For these examples, you will also need to run the
+`injector.Injector` program, which generates and publishes the gaming data to
+PubSub. The javadocs for each pipeline have more detailed information on how to
+run that pipeline.
+
+All of these pipelines write their results to BigQuery table(s).
+
+
+## The pipelines in the 'gaming' series
+
+### UserScore
+
+The first pipeline in the series is `UserScore`. This pipeline does batch
+processing of data collected from gaming events. It calculates the sum of
+scores per user, over an entire batch of gaming data (collected, say, for each
+day). The batch processing will not include any late data that arrives after
+the day's cutoff point.
+
+### HourlyTeamScore
+
+The next pipeline in the series is `HourlyTeamScore`. This pipeline also
+processes data collected from gaming events in batch. It builds on `UserScore`,
+but uses [fixed windows](https://cloud.google.com/dataflow/model/windowing), by
+default an hour in duration. It calculates the sum of scores per team, for each
+window, optionally allowing specification of two timestamps before and after
+which data is filtered out. This allows a model where late data collected after
+the intended analysis window can be included in the analysis, and any late-
+arriving data prior to the beginning of the analysis window can be removed as
+well.
+
+By using windowing and adding element timestamps, we can do finer-grained
+analysis than with the `UserScore` pipeline — we're now tracking scores for
+each hour rather than over the course of a whole day. However, our batch
+processing is high-latency, in that we don't get results from plays at the
+beginning of the batch's time period until the complete batch is processed.
+
+### LeaderBoard
+
+The third pipeline in the series is `LeaderBoard`. This pipeline processes an
+unbounded stream of 'game events' from a PubSub topic. The calculation of the
+team scores uses fixed windowing based on event time (the time of the game play
+event), not processing time (the time that an event is processed by the
+pipeline). The pipeline calculates the sum of scores per team, for each window.
+By default, the team scores are calculated using one-hour windows.
+
+In contrast — to demo another windowing option — the user scores are 
calculated
+using a global window, which periodically (every ten minutes) emits cumulative
+user score sums.
+
+In contrast to the previous pipelines in the series, which used static, finite
+input data, here we're using an unbounded data source, which lets us provide
+_speculative_ results, and allows handling of late data, at much lower latency.
+E.g., we could use the early/speculative results to keep a 'leaderboard'
+updated in near-realtime. Our handling of late data lets us generate correct
+results, e.g. for 'team prizes'. We're now outputing window results as they're
+calculated, giving us much lower latency than with the previous batch examples.
+
+### GameStats
+
+The fourth pipeline in the series is `GameStats`. This pipeline builds
+on the `LeaderBoard` functionality — supporting output of speculative and 
late
+data — and adds some "business intelligence" analysis: identifying abuse
+detection. The pipeline derives the Mean user score sum for a window, and uses
+that information to identify likely spammers/robots. (The injector is designed
+so that the "robots" have a higher 

[jira] [Commented] (BEAM-129) Support pubsub IO

2016-03-18 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201841#comment-15201841
 ] 

Kostas Kloudas commented on BEAM-129:
-

Hi Mark,

Thanks a lot for reporting this and for working on it!

We just migrated all the known issues of the Flink-runner 
from our previous repository to the BEAM one.

Kostas

> Support pubsub IO
> -
>
> Key: BEAM-129
> URL: https://issues.apache.org/jira/browse/BEAM-129
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kostas Kloudas
>Assignee: Maximilian Michels
>
> Support pubsub IO



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request: Improve PipelineOptionsFactoryTest

2016-03-18 Thread tgroh
GitHub user tgroh opened a pull request:

https://github.com/apache/incubator-beam/pull/60

Improve PipelineOptionsFactoryTest

Currently the test uses the literal string for the default runner and
available runners. Instead, refer to the default runner class and
extract the simple name from that class.

Automatically figure out portions of the error message for unknown runners.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgroh/incubator-beam 
more_resilient_options_factory_test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/60.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #60


commit 2feabffb1281534f9d8e86e9502c97e4f4bdfdb5
Author: Thomas Groh 
Date:   2016-03-18T22:20:49Z

Improve PipelineOptionsFactoryTest

Currently the test uses the literal string for the default runner and
available runners. Instead, refer to the default runner class and
extract the simple name from that class.

Automatically figure out portions of the error message for unknown runners.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request: [BEAM-136] Look up a runner if it is ...

2016-03-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/61


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (BEAM-133) Test flakiness in the Spark runner

2016-03-18 Thread Jason Kuster (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202273#comment-15202273
 ] 

Jason Kuster commented on BEAM-133:
---

We've changed the Jenkins projects to use a Maven repository local to the 
workspace. This should fix the problem; please let us know if it does not.

> Test flakiness in the Spark runner
> --
>
> Key: BEAM-133
> URL: https://issues.apache.org/jira/browse/BEAM-133
> Project: Beam
>  Issue Type: Bug
>  Components: project-management
>Reporter: Davor Bonaci
>Assignee: Davor Bonaci
>
> Jenkins shows some flakiness in the Spark runner in the context of an 
> unrelated pre-commit test.
> {code}
> Results :
> Tests in error: 
>   AvroPipelineTest.testGeneric:75 » Runtime java.io.IOException: Could 
> not creat...
>   NumShardsTest.testText:77 » Runtime java.io.IOException: Could not 
> create File...
>   HadoopFileFormatPipelineTest.testSequenceFile:83 » Runtime 
> java.io.IOException...
>   
> TransformTranslatorTest.testTextIOReadAndWriteTransforms:76->runPipeline:96 » 
> Runtime
>   KafkaStreamingTest.testRun:121 » Runtime java.io.IOException: failure 
> to login...
> Tests run: 27, Failures: 0, Errors: 5, Skipped: 0
> [ERROR] There are test failures.
> {code}
> https://builds.apache.org/job/beam_PreCommit/98/console
> Amit, does this sounds like a test code issue or the infrastructure issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request: [BEAM-22] Implement InProcessPipeline...

2016-03-18 Thread tgroh
GitHub user tgroh opened a pull request:

https://github.com/apache/incubator-beam/pull/62

[BEAM-22] Implement InProcessPipelineRunner#run

Appropriately construct an evaluation context and executor, and start
the pipeline when run is called.

Implement InProcessPipelineResult.

Apply PTransform overrides.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgroh/incubator-beam ippr_runner_run

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/62.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #62


commit bb767860e13b8311c6cd090e5aeb9c323396638b
Author: Thomas Groh 
Date:   2016-02-27T01:30:13Z

Implement InProcessPipelineRunner#run

Appropriately construct an evaluation context and executor, and start
the pipeline when run is called.

Implement InProcessPipelineResult.

Apply PTransform overrides.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (BEAM-22) DirectPipelineRunner: support for unbounded collections

2016-03-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202364#comment-15202364
 ] 

ASF GitHub Bot commented on BEAM-22:


GitHub user tgroh opened a pull request:

https://github.com/apache/incubator-beam/pull/62

[BEAM-22] Implement InProcessPipelineRunner#run

Appropriately construct an evaluation context and executor, and start
the pipeline when run is called.

Implement InProcessPipelineResult.

Apply PTransform overrides.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgroh/incubator-beam ippr_runner_run

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/62.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #62


commit bb767860e13b8311c6cd090e5aeb9c323396638b
Author: Thomas Groh 
Date:   2016-02-27T01:30:13Z

Implement InProcessPipelineRunner#run

Appropriately construct an evaluation context and executor, and start
the pipeline when run is called.

Implement InProcessPipelineResult.

Apply PTransform overrides.




> DirectPipelineRunner: support for unbounded collections
> ---
>
> Key: BEAM-22
> URL: https://issues.apache.org/jira/browse/BEAM-22
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-direct
>Reporter: Davor Bonaci
>Assignee: Thomas Groh
>
> DirectPipelineRunner currently runs over bounded PCollections only, and 
> implements only a portion of the Beam Model.
> We should improve it to faithfully implement the full Beam Model, such as add 
> ability to run over unbounded PCollections, and better resemble execution 
> model in a distributed system.
> This further enables features such as a testing source which may simulate 
> late data and test triggers in the pipeline. Finally, we may want to expose 
> an option to select between "debug" (single threaded), "chaos monkey" (test 
> as many model requirements as possible), and "performance" (multi-threaded).
> more testing (chaos monkey) 
> Once this is done, we should update this StackOverflow question:
> http://stackoverflow.com/questions/35350113/testing-triggers-with-processing-time/35401426#35401426



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-133) Test flakiness in the Spark runner

2016-03-18 Thread Davor Bonaci (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201834#comment-15201834
 ] 

Davor Bonaci commented on BEAM-133:
---

Acknowledged. I'll take this over.

> Test flakiness in the Spark runner
> --
>
> Key: BEAM-133
> URL: https://issues.apache.org/jira/browse/BEAM-133
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Davor Bonaci
>Assignee: Davor Bonaci
>
> Jenkins shows some flakiness in the Spark runner in the context of an 
> unrelated pre-commit test.
> {code}
> Results :
> Tests in error: 
>   AvroPipelineTest.testGeneric:75 » Runtime java.io.IOException: Could 
> not creat...
>   NumShardsTest.testText:77 » Runtime java.io.IOException: Could not 
> create File...
>   HadoopFileFormatPipelineTest.testSequenceFile:83 » Runtime 
> java.io.IOException...
>   
> TransformTranslatorTest.testTextIOReadAndWriteTransforms:76->runPipeline:96 » 
> Runtime
>   KafkaStreamingTest.testRun:121 » Runtime java.io.IOException: failure 
> to login...
> Tests run: 27, Failures: 0, Errors: 5, Skipped: 0
> [ERROR] There are test failures.
> {code}
> https://builds.apache.org/job/beam_PreCommit/98/console
> Amit, does this sounds like a test code issue or the infrastructure issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-128) The transform BigQueryIO.Read is currently not supported.

2016-03-18 Thread Kostas Kloudas (JIRA)
Kostas Kloudas created BEAM-128:
---

 Summary: The transform BigQueryIO.Read is currently not supported.
 Key: BEAM-128
 URL: https://issues.apache.org/jira/browse/BEAM-128
 Project: Beam
  Issue Type: New Feature
  Components: runner-flink
Reporter: Kostas Kloudas
Assignee: Maximilian Michels






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-133) Test flakiness in the Spark runner

2016-03-18 Thread Davor Bonaci (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davor Bonaci closed BEAM-133.
-

> Test flakiness in the Spark runner
> --
>
> Key: BEAM-133
> URL: https://issues.apache.org/jira/browse/BEAM-133
> Project: Beam
>  Issue Type: Bug
>  Components: project-management
>Reporter: Davor Bonaci
>Assignee: Davor Bonaci
>
> Jenkins shows some flakiness in the Spark runner in the context of an 
> unrelated pre-commit test.
> {code}
> Results :
> Tests in error: 
>   AvroPipelineTest.testGeneric:75 » Runtime java.io.IOException: Could 
> not creat...
>   NumShardsTest.testText:77 » Runtime java.io.IOException: Could not 
> create File...
>   HadoopFileFormatPipelineTest.testSequenceFile:83 » Runtime 
> java.io.IOException...
>   
> TransformTranslatorTest.testTextIOReadAndWriteTransforms:76->runPipeline:96 » 
> Runtime
>   KafkaStreamingTest.testRun:121 » Runtime java.io.IOException: failure 
> to login...
> Tests run: 27, Failures: 0, Errors: 5, Skipped: 0
> [ERROR] There are test failures.
> {code}
> https://builds.apache.org/job/beam_PreCommit/98/console
> Amit, does this sounds like a test code issue or the infrastructure issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[2/2] incubator-beam git commit: This closes #55

2016-03-18 Thread amitsela
This closes #55


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/659f0b87
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/659f0b87
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/659f0b87

Branch: refs/heads/master
Commit: 659f0b8779c05e160d7e2c031b18b7bceb00de79
Parents: ef1e32d b0db313
Author: Sela 
Authored: Thu Mar 17 18:58:54 2016 +0200
Committer: Sela 
Committed: Thu Mar 17 18:58:54 2016 +0200

--
 runners/spark/README.md | 112 ---
 1 file changed, 63 insertions(+), 49 deletions(-)
--




[2/3] incubator-beam-site git commit: Add content files missed in PR2 due to an overly aggressive .gitignore filter

2016-03-18 Thread davor
Add content files missed in PR2 due to an overly aggressive .gitignore filter

Fix baseurl used for local staging, and s/compatability/capability/ in blog 
post filename

Fix one more s/compatability/capability/ typo

Fix up capability rename for reals.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/04e1fbbb
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/04e1fbbb
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/04e1fbbb

Branch: refs/heads/asf-site
Commit: 04e1fbbb035d6bfd53268774daa85c7c98105d93
Parents: b04b002
Author: Tyler Akidau 
Authored: Thu Mar 17 15:54:47 2016 -0700
Committer: Davor Bonaci 
Committed: Thu Mar 17 16:34:05 2016 -0700

--
 _posts/2016-03-17-capability-matrix.md  |  596 +++
 _posts/2016-03-17-compatability-matrix.md   |  596 ---
 .../2016/03/17/capability-matrix.html   |  896 ++
 content/blog/index.html |2 +-
 content/capability-matrix/index.html| 1652 ++
 content/feed.xml|   10 +-
 content/index.html  |2 +-
 7 files changed, 3151 insertions(+), 603 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/04e1fbbb/_posts/2016-03-17-capability-matrix.md
--
diff --git a/_posts/2016-03-17-capability-matrix.md 
b/_posts/2016-03-17-capability-matrix.md
new file mode 100644
index 000..9ede7d9
--- /dev/null
+++ b/_posts/2016-03-17-capability-matrix.md
@@ -0,0 +1,596 @@
+---
+layout: post
+title:  "Clarifying & Formalizing Runner Capabilities"
+date:   2016-03-17 11:00:00 -0700
+excerpt_separator: 
+categories: beam capability
+authors:
+  - fjp
+  - takidau
+
+capability-matrix-snapshot:
+  columns:
+- class: model
+  name: Beam Model
+- class: dataflow
+  name: Google Cloud Dataflow
+- class: flink
+  name: Apache Flink
+- class: spark
+  name: Apache Spark
+  categories:
+- description: What is being computed?
+  anchor: what
+  color-b: 'ca1'
+  color-y: 'ec3'
+  color-p: 'fe5'
+  color-n: 'ddd'
+  rows:
+- name: ParDo
+  values:
+- class: model
+  l1: 'Yes'
+  l2: element-wise processing
+  l3: Element-wise transformation parameterized by a chunk of user 
code. Elements are processed in bundles, with initialization and termination 
hooks. Bundle size is chosen by the runner and cannot be controlled by user 
code. ParDo processes a main input PCollection one element at a time, but 
provides side input access to additional PCollections.
+- class: dataflow
+  l1: 'Yes'
+  l2: fully supported
+  l3: Batch mode uses large bundle sizes. Streaming uses smaller 
bundle sizes.
+- class: flink
+  l1: 'Yes'
+  l2: fully supported
+  l3: ParDo itself, as per-element transformation with UDFs, is 
fully supported by Flink for both batch and streaming.
+- class: spark
+  l1: 'Yes'
+  l2: fully supported
+  l3: ParDo applies per-element transformations as Spark 
FlatMapFunction.
+- name: GroupByKey
+  values:
+- class: model
+  l1: 'Yes'
+  l2: key grouping
+  l3: Grouping of key-value pairs per key, window, and pane. (See 
also other tabs.)
+- class: dataflow
+  l1: 'Yes'
+  l2: fully supported
+  l3: ''
+- class: flink
+  l1: 'Yes'
+  l2: fully supported
+  l3: "Uses Flink's keyBy for key grouping. When grouping by 
window in streaming (creating the panes) the Flink runner uses the Beam code. 
This guarantees support for all windowing and triggering mechanisms."
+- class: spark
+  l1: 'Partially'
+  l2: group by window in batch only
+  l3: "Uses Spark's groupByKey for grouping. Grouping by window is 
currently only supported in batch."
+- name: Flatten
+  values:
+- class: model
+  l1: 'Yes'
+  l2: collection concatenation
+  l3: Concatenates multiple homogenously typed collections 
together.
+- class: dataflow
+  l1: 'Yes'
+  l2: fully supported
+  l3: ''
+- class: flink
+  l1: 'Yes'
+  l2: fully supported
+  l3: ''
+- class: spark
+  l1: 'Yes'
+  l2: fully supported
+  

[1/2] incubator-beam git commit: Look up a runner if it is not registered

2016-03-18 Thread lcwik
Repository: incubator-beam
Updated Branches:
  refs/heads/master 81d5ff5a5 -> a461e006a


Look up a runner if it is not registered

If a fully qualified runner is passed as the value of --runner, and it
is not present within the map of registered runners, attempts to look
up the runner using Class#forName, and uses the result class if the
result class is an instance of PipelineRunner. This brings the behavior
in line with the described behavior in PipelineOptions.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/e9dd155a
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/e9dd155a
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/e9dd155a

Branch: refs/heads/master
Commit: e9dd155a8dcc127f2c3ca4ae522d00277bc1
Parents: 81d5ff5
Author: Thomas Groh 
Authored: Fri Mar 18 16:20:56 2016 -0700
Committer: Luke Cwik 
Committed: Fri Mar 18 19:19:43 2016 -0700

--
 .../sdk/options/PipelineOptionsFactory.java | 31 ++---
 .../sdk/options/PipelineOptionsFactoryTest.java | 36 
 2 files changed, 62 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/e9dd155a/sdk/src/main/java/com/google/cloud/dataflow/sdk/options/PipelineOptionsFactory.java
--
diff --git 
a/sdk/src/main/java/com/google/cloud/dataflow/sdk/options/PipelineOptionsFactory.java
 
b/sdk/src/main/java/com/google/cloud/dataflow/sdk/options/PipelineOptionsFactory.java
index e77b89f..48cff6d 100644
--- 
a/sdk/src/main/java/com/google/cloud/dataflow/sdk/options/PipelineOptionsFactory.java
+++ 
b/sdk/src/main/java/com/google/cloud/dataflow/sdk/options/PipelineOptionsFactory.java
@@ -16,6 +16,8 @@
 
 package com.google.cloud.dataflow.sdk.options;
 
+import static com.google.common.base.Preconditions.checkArgument;
+
 import com.google.cloud.dataflow.sdk.options.Validation.Required;
 import com.google.cloud.dataflow.sdk.runners.PipelineRunner;
 import com.google.cloud.dataflow.sdk.runners.PipelineRunnerRegistrar;
@@ -1391,7 +1393,10 @@ public class PipelineOptionsFactory {
* split up each string on ','.
*
* We special case the "runner" option. It is mapped to the class of the 
{@link PipelineRunner}
-   * based off of the {@link PipelineRunner}s simple class name or fully 
qualified class name.
+   * based off of the {@link PipelineRunner PipelineRunners} simple class 
name. If the provided
+   * runner name is not registered via a {@link PipelineRunnerRegistrar}, we 
attempt to obtain the
+   * class that the name represents using {@link Class#forName(String)} and 
use the result class if
+   * it subclasses {@link PipelineRunner}.
*
* If strict parsing is enabled, unknown options or options that cannot 
be converted to
* the expected java type using an {@link ObjectMapper} will be ignored.
@@ -1442,10 +1447,26 @@ public class PipelineOptionsFactory {
 JavaType type = 
MAPPER.getTypeFactory().constructType(method.getGenericReturnType());
 if ("runner".equals(entry.getKey())) {
   String runner = Iterables.getOnlyElement(entry.getValue());
-  
Preconditions.checkArgument(SUPPORTED_PIPELINE_RUNNERS.containsKey(runner),
-  "Unknown 'runner' specified '%s', supported pipeline runners %s",
-  runner, Sets.newTreeSet(SUPPORTED_PIPELINE_RUNNERS.keySet()));
-  convertedOptions.put("runner", 
SUPPORTED_PIPELINE_RUNNERS.get(runner));
+  if (SUPPORTED_PIPELINE_RUNNERS.containsKey(runner)) {
+convertedOptions.put("runner", 
SUPPORTED_PIPELINE_RUNNERS.get(runner));
+  } else {
+try {
+  Class runnerClass = Class.forName(runner);
+  checkArgument(
+  PipelineRunner.class.isAssignableFrom(runnerClass),
+  "Class '%s' does not implement PipelineRunner. Supported 
pipeline runners %s",
+  runner,
+  Sets.newTreeSet(SUPPORTED_PIPELINE_RUNNERS.keySet()));
+  convertedOptions.put("runner", runnerClass);
+} catch (ClassNotFoundException e) {
+  String msg =
+  String.format(
+  "Unknown 'runner' specified '%s', supported pipeline 
runners %s",
+  runner,
+  Sets.newTreeSet(SUPPORTED_PIPELINE_RUNNERS.keySet()));
+throw new IllegalArgumentException(msg, e);
+}
+  }
 } else if ((returnType.isArray() && 
(SIMPLE_TYPES.contains(returnType.getComponentType())
 || returnType.getComponentType().isEnum()))
 || Collection.class.isAssignableFrom(returnType)) {


[jira] [Updated] (BEAM-53) PubSubIO: reimplement in Java

2016-03-18 Thread Mark Shields (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Shields updated BEAM-53:
-
Component/s: (was: sdk-java-gcp)
 runner-core

> PubSubIO: reimplement in Java
> -
>
> Key: BEAM-53
> URL: https://issues.apache.org/jira/browse/BEAM-53
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Daniel Halperin
>Assignee: Mark Shields
>Priority: Minor
>
> PubSubIO is currently only partially implemented in Java: the 
> DirectPipelineRunner uses a non-scalable API in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path 
> implemented in the Google Cloud Dataflow service.
> We need to reimplement PubSubIO in Java in order to support other runners in 
> a scalable way.
> Additionally, we can take this opportunity to add new features:
> * getting timestamp from an arbitrary lambda in arbitrary formats rather than 
> from a message attribute in only 2 formats.
> * exposing metadata and attributes in the elements produced by PubSubIO.Read
> * setting metadata and attributes in the messages written by PubSubIO.Write



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)