date:20170828

[jira] [Created] (BEAM-2819) Key only reads from Google Datastore

2017-08-28 Thread Ilnar Taichinov (JIRA)

Ilnar Taichinov created BEAM-2819:
-

 Summary: Key only reads from Google Datastore
 Key: BEAM-2819
 URL: https://issues.apache.org/jira/browse/BEAM-2819
 Project: Beam
  Issue Type: Wish
  Components: sdk-java-gcp
Reporter: Ilnar Taichinov
Assignee: Chamikara Jayalath


Currently there is no functionality allowing to read only keys from the Google 
Datastore through the Datastore IO. In some cases users don't need to read the 
whole entity, e.g. to filter by certain values in ancestry. This seems to be an 
important feature as the entity reads are expensive and thus the native 
Datastore client/API allow to run Key Only queries. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Closed] (BEAM-2126) Add JStorm runner to "Conbribute > Technical References > Ongoing Projects"

2017-08-28 Thread Pei He (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He closed BEAM-2126.

   Resolution: Fixed
Fix Version/s: Not applicable

> Add JStorm runner to "Conbribute > Technical References > Ongoing Projects"
> ---
>
> Key: BEAM-2126
> URL: https://issues.apache.org/jira/browse/BEAM-2126
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-jstorm, website
>Reporter: Kenneth Knowles
>Assignee: Pei He
> Fix For: Not applicable
>
>
> We should have this effort listed here: 
> https://beam.apache.org/contribute/work-in-progress/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Jenkins build is back to stable : beam_PostCommit_Java_ValidatesRunner_Dataflow #3860

2017-08-28 Thread Apache Jenkins Server

See

[GitHub] beam pull request #3774: Test show stall

2017-08-28 Thread mariapython

Github user mariapython closed the pull request at:

https://github.com/apache/beam/pull/3774


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Jenkins build is back to normal : beam_PostCommit_Java_MavenInstall #4670

2017-08-28 Thread Apache Jenkins Server

See

Build failed in Jenkins: beam_PostCommit_Java_MavenInstall #4669

2017-08-28 Thread Apache Jenkins Server

See 


Changes:

[altay] Use the same termination logic in different places

--
[...truncated 3.61 MB...]
[WARNING] Cannot calculate digest of mojo class, because mojo wasn't loaded 
from a jar, but from: 

2017-08-29T01:21:20.116 [INFO] 
2017-08-29T01:21:20.116 [INFO] --- maven-compiler-plugin:3.6.2:testCompile 
(default-testCompile) @ beam-runners-flink_2.10 ---
2017-08-29T01:21:20.125 [INFO] Changes detected - recompiling the module!
2017-08-29T01:21:20.126 [INFO] Compiling 17 source files to 

2017-08-29T01:21:20.636 [INFO] 
:
 Some input files use unchecked or unsafe operations.
2017-08-29T01:21:20.636 [INFO] 
:
 Recompile with -Xlint:unchecked for details.
[WARNING] Cannot calculate digest of mojo class, because mojo wasn't loaded 
from a jar, but from: 

2017-08-29T01:21:20.743 [INFO] 
2017-08-29T01:21:20.743 [INFO] --- maven-checkstyle-plugin:2.17:check (default) 
@ beam-runners-flink_2.10 ---
2017-08-29T01:21:22.034 [INFO] Starting audit...
Audit done.
[WARNING] Cannot calculate digest of mojo class, because mojo wasn't loaded 
from a jar, but from: 

2017-08-29T01:21:22.156 [INFO] 
2017-08-29T01:21:22.156 [INFO] >>> findbugs-maven-plugin:3.0.4:check (default) 
> :findbugs @ beam-runners-flink_2.10 >>>
2017-08-29T01:21:22.174 [INFO] 
2017-08-29T01:21:22.174 [INFO] --- findbugs-maven-plugin:3.0.4:findbugs 
(findbugs) @ beam-runners-flink_2.10 ---
2017-08-29T01:21:22.180 [INFO] Fork Value is true
[WARNING] Cannot calculate digest of mojo class, because mojo wasn't loaded 
from a jar, but from: 

[WARNING] Cannot calculate digest of mojo class, because mojo wasn't loaded 
from a jar, but from: 

[WARNING] Cannot calculate digest of mojo class, because mojo wasn't loaded 
from a jar, but from: 

[JENKINS] Archiving disabled
2017-08-29T01:21:23.279 [INFO]  
   
2017-08-29T01:21:23.280 [INFO] 

2017-08-29T01:21:23.280 [INFO] Skipping Apache Beam :: Parent
2017-08-29T01:21:23.280 [INFO] This project has been banned from the build due 
to previous failures.
2017-08-29T01:21:23.280 [INFO] 

[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS]

Jenkins build became unstable: beam_PostCommit_Java_MavenInstall #4667

2017-08-28 Thread Apache Jenkins Server

See

[jira] [Updated] (BEAM-2817) Bigquery queries should allow options to run in batch mode or not

2017-08-28 Thread Chamikara Jayalath (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-2817:
-
Labels: newbie starter  (was: )

> Bigquery queries should allow options to run in batch mode or not
> -
>
> Key: BEAM-2817
> URL: https://issues.apache.org/jira/browse/BEAM-2817
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.0.0
>Reporter: Lara Schmidt
>Assignee: Chamikara Jayalath
>  Labels: newbie, starter
>
> When bigquery read does a query it sets the mode to batch. A batch query can 
> be very slow to schedule as it batches it with other queries. However it 
> doesn't use batch quota which is better for some cases. However, in some 
> cases a fast query is better (especially in timed tests). It would be a good 
> idea to have a configuration to the bigquery source to set this per-read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (BEAM-2817) Bigquery queries should allow options to run in batch mode or not

2017-08-28 Thread Chamikara Jayalath (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reassigned BEAM-2817:


Assignee: (was: Chamikara Jayalath)

> Bigquery queries should allow options to run in batch mode or not
> -
>
> Key: BEAM-2817
> URL: https://issues.apache.org/jira/browse/BEAM-2817
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.0.0
>Reporter: Lara Schmidt
>  Labels: newbie, starter
>
> When bigquery read does a query it sets the mode to batch. A batch query can 
> be very slow to schedule as it batches it with other queries. However it 
> doesn't use batch quota which is better for some cases. However, in some 
> cases a fast query is better (especially in timed tests). It would be a good 
> idea to have a configuration to the bigquery source to set this per-read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2817) Bigquery queries should allow options to run in batch mode or not

2017-08-28 Thread Chamikara Jayalath (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144591#comment-16144591
 ] 

Chamikara Jayalath commented on BEAM-2817:
--

Looks like this will be a good starter/newbie issue.

Relevant links:

https://cloud.google.com/bigquery/docs/running-queries#bigquery-query-batch-api
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryQuerySource.java#L167

> Bigquery queries should allow options to run in batch mode or not
> -
>
> Key: BEAM-2817
> URL: https://issues.apache.org/jira/browse/BEAM-2817
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.0.0
>Reporter: Lara Schmidt
>Assignee: Chamikara Jayalath
>
> When bigquery read does a query it sets the mode to batch. A batch query can 
> be very slow to schedule as it batches it with other queries. However it 
> doesn't use batch quota which is better for some cases. However, in some 
> cases a fast query is better (especially in timed tests). It would be a good 
> idea to have a configuration to the bigquery source to set this per-read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[1/2] beam git commit: Updates Dataflow worker to 20170825

2017-08-28 Thread jkff

Repository: beam
Updated Branches:
  refs/heads/master 79a594f63 -> e33cc24a5


Updates Dataflow worker to 20170825


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/0eb8abc8
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/0eb8abc8
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/0eb8abc8

Branch: refs/heads/master
Commit: 0eb8abc8d132628835e6575371d0c0f22900c6ad
Parents: 79a594f
Author: Eugene Kirpichov 
Authored: Fri Aug 25 18:40:49 2017 -0700
Committer: Eugene Kirpichov 
Committed: Mon Aug 28 17:13:04 2017 -0700

--
 runners/google-cloud-dataflow-java/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/0eb8abc8/runners/google-cloud-dataflow-java/pom.xml
--
diff --git a/runners/google-cloud-dataflow-java/pom.xml 
b/runners/google-cloud-dataflow-java/pom.xml
index 46352fb..4d55209 100644
--- a/runners/google-cloud-dataflow-java/pom.xml
+++ b/runners/google-cloud-dataflow-java/pom.xml
@@ -33,7 +33,7 @@
   jar
 
   
-
beam-master-20170706
+
beam-master-20170825
 
1
 
6

[GitHub] beam pull request #3771: Updates Dataflow worker to 20170825

2017-08-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3771


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[2/2] beam git commit: This closes #3771: Updates Dataflow worker to 20170825

2017-08-28 Thread jkff

This closes #3771: Updates Dataflow worker to 20170825


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/e33cc24a
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/e33cc24a
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/e33cc24a

Branch: refs/heads/master
Commit: e33cc24a577d049416451266f398e6b22f5c7058
Parents: 79a594f 0eb8abc
Author: Eugene Kirpichov 
Authored: Mon Aug 28 17:13:17 2017 -0700
Committer: Eugene Kirpichov 
Committed: Mon Aug 28 17:13:17 2017 -0700

--
 runners/google-cloud-dataflow-java/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--

[jira] [Commented] (BEAM-2816) Create SplittableDoFnTester

2017-08-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144576#comment-16144576
 ] 

ASF GitHub Bot commented on BEAM-2816:
--

GitHub user jkff opened a pull request:

https://github.com/apache/beam/pull/3778

[BEAM-2816] Introduces SplittableDoFnTester

Not ready for review yet, though the approach is likely to remain the same. 
Will try to use this to test `Watch`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkff/incubator-beam sdf-testing

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3778.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3778


commit 6df3ddad532463a03dc63e8ba7198e084dede6fe
Author: Eugene Kirpichov 
Date:   2017-08-29T00:08:52Z

[BEAM-2816] Introduces SplittableDoFnTester




> Create SplittableDoFnTester
> ---
>
> Key: BEAM-2816
> URL: https://issues.apache.org/jira/browse/BEAM-2816
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> Should have something like SourceTestUtils but for Splittable DoFn.
> Note: SourceTestUtils is focused only on bounded sources, but with SDF we can 
> not assume boundedness.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] beam pull request #3775: Use the same termination logic in different places ...

2017-08-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3775


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[1/2] beam git commit: Use the same termination logic in different places

2017-08-28 Thread altay

Repository: beam
Updated Branches:
  refs/heads/master bbca4f741 -> 79a594f63


Use the same termination logic in different places


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/16f11326
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/16f11326
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/16f11326

Branch: refs/heads/master
Commit: 16f11326b875cc6598123f17135f3908e0acf0cb
Parents: bbca4f7
Author: Ahmet Altay 
Authored: Mon Aug 28 13:32:32 2017 -0700
Committer: Ahmet Altay 
Committed: Mon Aug 28 17:06:44 2017 -0700

--
 sdks/python/apache_beam/runners/dataflow/dataflow_runner.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/16f11326/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py
--
diff --git a/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py 
b/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py
index 813759e..2b52f78 100644
--- a/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py
+++ b/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py
@@ -949,7 +949,9 @@ class DataflowPipelineResult(PipelineResult):
   while thread.isAlive():
 time.sleep(5.0)
 
-  terminated = self._is_in_terminal_state()
+  # TODO: Merge the termination code in poll_for_job_completion and
+  # _is_in_terminal_state.
+  terminated = (str(self._job.currentState) != 'JOB_STATE_RUNNING')
   assert duration or terminated, (
   'Job did not reach to a terminal state after waiting indefinitely.')

[2/2] beam git commit: This closes #3775

2017-08-28 Thread altay

This closes #3775


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/79a594f6
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/79a594f6
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/79a594f6

Branch: refs/heads/master
Commit: 79a594f638c6016836c24e6693a62b2074a78c67
Parents: bbca4f7 16f1132
Author: Ahmet Altay 
Authored: Mon Aug 28 17:06:47 2017 -0700
Committer: Ahmet Altay 
Committed: Mon Aug 28 17:06:47 2017 -0700

--
 sdks/python/apache_beam/runners/dataflow/dataflow_runner.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--

[jira] [Commented] (BEAM-2612) support variance builtin aggregation function

2017-08-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144567#comment-16144567
 ] 

ASF GitHub Bot commented on BEAM-2612:
--

GitHub user vectorijk reopened a pull request:

https://github.com/apache/beam/pull/3568

[BEAM-2612] Variance builtin aggregation functions

implement var_pop, var_samp builtin aggregation functions


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vectorijk/beam beam-2612-variance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3568.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3568


commit 8fbb9006dd479b42a0798fe1c1eaf0e509649bc2
Author: Kai Jiang 
Date:   2017-08-16T23:13:26Z

var_pop and var_samp

two builtin aggregation functions

commit 53612b5f9e256248212c5d743ea5410a8eeffd5f
Author: Kai Jiang 
Date:   2017-08-17T00:26:04Z

fix checkstyle

commit 1256f63b2686c0e7d1d0c2a5863945cff9c79fea
Author: Kai Jiang 
Date:   2017-08-28T20:06:36Z

address comments

change type of sum in class VarAgg to BigDecimal
move isSamp field to Var
rename VarPop -> Var to make more generic
move logic to prepareOutput() both for Avg and Var
set MathContext to handle potential exception with BigDecimal divide.

commit 27b84cb4702e9d8960d1bfc5b59e48d8e5f446dd
Author: Kai Jiang 
Date:   2017-08-28T20:13:38Z

newlines

commit 7db3f490aeeb34f01b9064e274feacacbfa67092
Author: Kai Jiang 
Date:   2017-08-28T20:18:09Z

Merge branch 'DSL_SQL' into beam-2612-variance

commit fb0b05c163f6de2760c13af05abd9b41d9b57de9
Author: Kai Jiang 
Date:   2017-08-28T23:36:47Z

rebase issue




> support variance builtin aggregation function
> -
>
> Key: BEAM-2612
> URL: https://issues.apache.org/jira/browse/BEAM-2612
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> two builtin aggregate functions
> VAR_POP
> the population variance (square of the population standard deviation)
> VAR_SAMP
> the sample variance (square of the sample standard deviation)
> https://calcite.apache.org/docs/reference.html#aggregate-functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2612) support variance builtin aggregation function

2017-08-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144566#comment-16144566
 ] 

ASF GitHub Bot commented on BEAM-2612:
--

Github user vectorijk closed the pull request at:

https://github.com/apache/beam/pull/3568


> support variance builtin aggregation function
> -
>
> Key: BEAM-2612
> URL: https://issues.apache.org/jira/browse/BEAM-2612
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> two builtin aggregate functions
> VAR_POP
> the population variance (square of the population standard deviation)
> VAR_SAMP
> the sample variance (square of the sample standard deviation)
> https://calcite.apache.org/docs/reference.html#aggregate-functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] beam pull request #3568: [BEAM-2612] Variance builtin aggregation functions

2017-08-28 Thread vectorijk

GitHub user vectorijk reopened a pull request:

https://github.com/apache/beam/pull/3568

[BEAM-2612] Variance builtin aggregation functions

implement var_pop, var_samp builtin aggregation functions


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vectorijk/beam beam-2612-variance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3568.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3568


commit 8fbb9006dd479b42a0798fe1c1eaf0e509649bc2
Author: Kai Jiang 
Date:   2017-08-16T23:13:26Z

var_pop and var_samp

two builtin aggregation functions

commit 53612b5f9e256248212c5d743ea5410a8eeffd5f
Author: Kai Jiang 
Date:   2017-08-17T00:26:04Z

fix checkstyle

commit 1256f63b2686c0e7d1d0c2a5863945cff9c79fea
Author: Kai Jiang 
Date:   2017-08-28T20:06:36Z

address comments

change type of sum in class VarAgg to BigDecimal
move isSamp field to Var
rename VarPop -> Var to make more generic
move logic to prepareOutput() both for Avg and Var
set MathContext to handle potential exception with BigDecimal divide.

commit 27b84cb4702e9d8960d1bfc5b59e48d8e5f446dd
Author: Kai Jiang 
Date:   2017-08-28T20:13:38Z

newlines

commit 7db3f490aeeb34f01b9064e274feacacbfa67092
Author: Kai Jiang 
Date:   2017-08-28T20:18:09Z

Merge branch 'DSL_SQL' into beam-2612-variance

commit fb0b05c163f6de2760c13af05abd9b41d9b57de9
Author: Kai Jiang 
Date:   2017-08-28T23:36:47Z

rebase issue




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] beam pull request #3568: [BEAM-2612] Variance builtin aggregation functions

2017-08-28 Thread vectorijk

Github user vectorijk closed the pull request at:

https://github.com/apache/beam/pull/3568


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (BEAM-2814) test_as_singleton_with_different_defaults test is flaky

2017-08-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144554#comment-16144554
 ] 

ASF GitHub Bot commented on BEAM-2814:
--

GitHub user aaltay opened a pull request:

https://github.com/apache/beam/pull/3777

[BEAM-2814] Update error message in AsSingleton

R: @robertwb 

Updated message would be useful for debugging the issue in BEAM-2814.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aaltay/beam sit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3777.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3777


commit 306a1dc4a1e072a65c739e179b5e1017e2069709
Author: Ahmet Altay 
Date:   2017-08-28T22:34:26Z

Add a log message to ValueError in AsSingleton




> test_as_singleton_with_different_defaults test is flaky
> ---
>
> Key: BEAM-2814
> URL: https://issues.apache.org/jira/browse/BEAM-2814
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Priority: Critical
>
> {{test_as_singleton_with_different_defaults}} is flaky and failed in the post 
> commit test 3013, but there is no related change to trigger this.
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3013/consoleFull
> (https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-08-28_11_08_56-17324181904913254210?project=apache-beam-testing)
> Dataflow error form the console:
>   (b4d390f9f9e033b4): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 582, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File "apache_beam/runners/worker/operations.py", line 294, in 
> apache_beam.runners.worker.operations.DoOperation.start 
> (apache_beam/runners/worker/operations.c:10607)
> def start(self):
>   File "apache_beam/runners/worker/operations.py", line 295, in 
> apache_beam.runners.worker.operations.DoOperation.start 
> (apache_beam/runners/worker/operations.c:10501)
> with self.scoped_start_state:
>   File "apache_beam/runners/worker/operations.py", line 323, in 
> apache_beam.runners.worker.operations.DoOperation.start 
> (apache_beam/runners/worker/operations.c:10322)
> self.dofn_runner = common.DoFnRunner(
>   File "apache_beam/runners/common.py", line 378, in 
> apache_beam.runners.common.DoFnRunner.__init__ 
> (apache_beam/runners/common.c:10018)
> self.do_fn_invoker = DoFnInvoker.create_invoker(
>   File "apache_beam/runners/common.py", line 154, in 
> apache_beam.runners.common.DoFnInvoker.create_invoker 
> (apache_beam/runners/common.c:5212)
> return PerWindowInvoker(
>   File "apache_beam/runners/common.py", line 219, in 
> apache_beam.runners.common.PerWindowInvoker.__init__ 
> (apache_beam/runners/common.c:7109)
> input_args, input_kwargs, [si[global_window] for si in side_inputs])
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/sideinputs.py",
>  line 63, in __getitem__
> _FilteringIterable(self._iterable, target_window), self._view_options)
>   File "/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.py", line 
> 332, in _from_runtime_iterable
> 'PCollection with more than one element accessed as '
> ValueError: PCollection with more than one element accessed as a singleton 
> view.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] beam pull request #3768: Fix beam_job_api to conform to proto naming convent...

2017-08-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3768


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[2/2] beam git commit: Fix beam_job_api to conform to proto naming conventions.

2017-08-28 Thread lcwik

Fix beam_job_api to conform to proto naming conventions.

This closes #3768


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/bbca4f74
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/bbca4f74
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/bbca4f74

Branch: refs/heads/master
Commit: bbca4f7418a7a47f9a59a858c534ac4f2ec1d825
Parents: aef89de 4764883
Author: Luke Cwik 
Authored: Mon Aug 28 16:39:55 2017 -0700
Committer: Luke Cwik 
Committed: Mon Aug 28 16:39:55 2017 -0700

--
 .../src/main/proto/beam_job_api.proto   | 38 ++--
 1 file changed, 19 insertions(+), 19 deletions(-)
--

[1/2] beam git commit: Fix beam_job_api to conform to proto naming conventions.

2017-08-28 Thread lcwik

Repository: beam
Updated Branches:
  refs/heads/master aef89dea8 -> bbca4f741


Fix beam_job_api to conform to proto naming conventions.


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/4764883a
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/4764883a
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/4764883a

Branch: refs/heads/master
Commit: 4764883a19174f0887ec091aa109e36881e102bf
Parents: aef89de
Author: Robert Bradshaw 
Authored: Mon Aug 28 10:10:35 2017 -0700
Committer: Robert Bradshaw 
Committed: Mon Aug 28 10:10:35 2017 -0700

--
 .../src/main/proto/beam_job_api.proto   | 38 ++--
 1 file changed, 19 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/4764883a/sdks/common/runner-api/src/main/proto/beam_job_api.proto
--
diff --git a/sdks/common/runner-api/src/main/proto/beam_job_api.proto 
b/sdks/common/runner-api/src/main/proto/beam_job_api.proto
index 8946d2a..5fa02ba 100644
--- a/sdks/common/runner-api/src/main/proto/beam_job_api.proto
+++ b/sdks/common/runner-api/src/main/proto/beam_job_api.proto
@@ -36,22 +36,22 @@ import "google/protobuf/struct.proto";
 service JobService {
   // Prepare a job for execution. The job will not be executed until a call is 
made to run with the
   // returned preparationId.
-  rpc prepare (PrepareJobRequest) returns (PrepareJobResponse);
+  rpc Prepare (PrepareJobRequest) returns (PrepareJobResponse);
 
   // Submit the job for execution
-  rpc run (RunJobRequest) returns (RunJobResponse);
+  rpc Run (RunJobRequest) returns (RunJobResponse);
 
   // Get the current state of the job
-  rpc getState (GetJobStateRequest) returns (GetJobStateResponse);
+  rpc GetState (GetJobStateRequest) returns (GetJobStateResponse);
 
   // Cancel the job
-  rpc cancel (CancelJobRequest) returns (CancelJobResponse);
+  rpc Cancel (CancelJobRequest) returns (CancelJobResponse);
 
   // Subscribe to a stream of state changes of the job, will immediately 
return the current state of the job as the first response.
-  rpc getStateStream (GetJobStateRequest) returns (stream GetJobStateResponse);
+  rpc GetStateStream (GetJobStateRequest) returns (stream GetJobStateResponse);
 
   // Subscribe to a stream of state changes and messages from the job
-  rpc getMessageStream (JobMessagesRequest) returns (stream 
JobMessagesResponse);
+  rpc GetMessageStream (JobMessagesRequest) returns (stream 
JobMessagesResponse);
 }
 
 
@@ -61,14 +61,14 @@ service JobService {
 // Throws error UNKNOWN for all other issues
 message PrepareJobRequest {
   org.apache.beam.runner_api.v1.Pipeline pipeline = 1; // (required)
-  google.protobuf.Struct pipelineOptions = 2; // (required)
-  string jobName = 3;  // (required)
+  google.protobuf.Struct pipeline_options = 2; // (required)
+  string job_name = 3;  // (required)
 }
 
 message PrepareJobResponse {
   // (required) The ID used to associate calls made while preparing the job. 
preparationId is used
   // to run the job, as well as in other pre-execution APIs such as Artifact 
staging.
-  string preparationId = 1;
+  string preparation_id = 1;
 }
 
 
@@ -79,15 +79,15 @@ message PrepareJobResponse {
 message RunJobRequest {
   // (required) The ID provided by an earlier call to prepare. Runs the job. 
All prerequisite tasks
   // must have been completed.
-  string preparationId = 1;
+  string preparation_id = 1;
   // (optional) If any artifacts have been staged for this job, contains the 
staging_token returned
   // from the CommitManifestResponse.
-  string stagingToken = 2;
+  string staging_token = 2;
 }
 
 
 message RunJobResponse {
-  string jobId = 1; // (required) The ID for the executing job
+  string job_id = 1; // (required) The ID for the executing job
 }
 
 
@@ -95,7 +95,7 @@ message RunJobResponse {
 // Throws error GRPC_STATUS_UNAVAILABLE if server is down
 // Throws error NOT_FOUND if the jobId is not found
 message CancelJobRequest {
-  string jobId = 1; // (required)
+  string job_id = 1; // (required)
 
 }
 
@@ -109,7 +109,7 @@ message CancelJobResponse {
 // Throws error GRPC_STATUS_UNAVAILABLE if server is down
 // Throws error NOT_FOUND if the jobId is not found
 message GetJobStateRequest {
-  string jobId = 1; // (required)
+  string job_id = 1; // (required)
 
 }
 
@@ -123,15 +123,15 @@ message GetJobStateResponse {
 // and job messages back; one is used for logging and the other for detecting
 // the job ended.
 message JobMessagesRequest {
-  string jobId = 1; // (required)
+  string job_id = 1; // (required)
 
 }
 
 message JobMessage {
-  string messageId = 1;
+  string message_id = 1;
   string time = 2;
   MessageImportance importance = 3;
-  string

[jira] [Resolved] (BEAM-2813) error: option --test-pipeline-options not recognized

2017-08-28 Thread Ahmet Altay (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Altay resolved BEAM-2813.
---
   Resolution: Won't Fix
Fix Version/s: Not applicable

> error: option --test-pipeline-options not recognized
> 
>
> Key: BEAM-2813
> URL: https://issues.apache.org/jira/browse/BEAM-2813
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Mark Liu
> Fix For: Not applicable
>
>
> Python post commits 3004 to 3008 (all 5) failed with this error, but somehow 
> fixed in 3009. Mark do you know what might be causing this?
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3004/
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3009/
> The error is:
> # Run ValidatesRunner tests on Google Cloud Dataflow service
> echo ">>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS"
> >>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS
> python setup.py nosetests \
>   --attr ValidatesRunner \
>   --nocapture \
>   --processes=4 \
>   --process-timeout=900 \
>   --test-pipeline-options=" \
> --runner=TestDataflowRunner \
> --project=$PROJECT \
> --staging_location=$GCS_LOCATION/staging-validatesrunner-test \
> --temp_location=$GCS_LOCATION/temp-validatesrunner-test \
> --sdk_location=$SDK_LOCATION \
> --requirements_file=postcommit_requirements.txt \
> --num_workers=1"
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/local/lib/python2.7/site-packages/setuptools/dist.py:341:
>  UserWarning: Normalizing '2.2.0.dev' to '2.2.0.dev0'
>   normalized_version,
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/.eggs/nose-1.3.7-py2.7.egg/nose/plugins/manager.py:395:
>  RuntimeWarning: Unable to load plugin beam_test_plugin = 
> test_config:BeamTestPlugin: (pyasn1 0.3.3 
> (/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
>  Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
>   RuntimeWarning)
> usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>or: setup.py --help [cmd1 cmd2 ...]
>or: setup.py --help-commands
>or: setup.py cmd --help
> error: option --test-pipeline-options not recognized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2813) error: option --test-pipeline-options not recognized

2017-08-28 Thread Ahmet Altay (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144545#comment-16144545
 ] 

Ahmet Altay commented on BEAM-2813:
---

I think printing potential conflicts is a good idea. That could be a separate 
JIRA issue for improvement.

Based on your comment, it sounds like there is nothing we can do to address 
this specific problem. I will close this issue.

> error: option --test-pipeline-options not recognized
> 
>
> Key: BEAM-2813
> URL: https://issues.apache.org/jira/browse/BEAM-2813
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Mark Liu
>
> Python post commits 3004 to 3008 (all 5) failed with this error, but somehow 
> fixed in 3009. Mark do you know what might be causing this?
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3004/
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3009/
> The error is:
> # Run ValidatesRunner tests on Google Cloud Dataflow service
> echo ">>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS"
> >>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS
> python setup.py nosetests \
>   --attr ValidatesRunner \
>   --nocapture \
>   --processes=4 \
>   --process-timeout=900 \
>   --test-pipeline-options=" \
> --runner=TestDataflowRunner \
> --project=$PROJECT \
> --staging_location=$GCS_LOCATION/staging-validatesrunner-test \
> --temp_location=$GCS_LOCATION/temp-validatesrunner-test \
> --sdk_location=$SDK_LOCATION \
> --requirements_file=postcommit_requirements.txt \
> --num_workers=1"
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/local/lib/python2.7/site-packages/setuptools/dist.py:341:
>  UserWarning: Normalizing '2.2.0.dev' to '2.2.0.dev0'
>   normalized_version,
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/.eggs/nose-1.3.7-py2.7.egg/nose/plugins/manager.py:395:
>  RuntimeWarning: Unable to load plugin beam_test_plugin = 
> test_config:BeamTestPlugin: (pyasn1 0.3.3 
> (/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
>  Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
>   RuntimeWarning)
> usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>or: setup.py --help [cmd1 cmd2 ...]
>or: setup.py --help-commands
>or: setup.py cmd --help
> error: option --test-pipeline-options not recognized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2813) error: option --test-pipeline-options not recognized

2017-08-28 Thread Mark Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144523#comment-16144523
 ] 

Mark Liu commented on BEAM-2813:


Not sure if pinning a dependency will work since both pyasn1 and pyasn1-modules 
are not directly required by Beam. Also looks like Nose needs to check each 
installed packages and it's dependencies' version before load a customized 
plugin.

However we can use tools like 
[pipdeptree|https://pypi.python.org/pypi/pipdeptree] to show and verify 
dependency tree separately. 
[pipdeptree|https://pypi.python.org/pypi/pipdeptree] can print potential 
conflicts to console. 

> error: option --test-pipeline-options not recognized
> 
>
> Key: BEAM-2813
> URL: https://issues.apache.org/jira/browse/BEAM-2813
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Mark Liu
>
> Python post commits 3004 to 3008 (all 5) failed with this error, but somehow 
> fixed in 3009. Mark do you know what might be causing this?
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3004/
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3009/
> The error is:
> # Run ValidatesRunner tests on Google Cloud Dataflow service
> echo ">>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS"
> >>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS
> python setup.py nosetests \
>   --attr ValidatesRunner \
>   --nocapture \
>   --processes=4 \
>   --process-timeout=900 \
>   --test-pipeline-options=" \
> --runner=TestDataflowRunner \
> --project=$PROJECT \
> --staging_location=$GCS_LOCATION/staging-validatesrunner-test \
> --temp_location=$GCS_LOCATION/temp-validatesrunner-test \
> --sdk_location=$SDK_LOCATION \
> --requirements_file=postcommit_requirements.txt \
> --num_workers=1"
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/local/lib/python2.7/site-packages/setuptools/dist.py:341:
>  UserWarning: Normalizing '2.2.0.dev' to '2.2.0.dev0'
>   normalized_version,
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/.eggs/nose-1.3.7-py2.7.egg/nose/plugins/manager.py:395:
>  RuntimeWarning: Unable to load plugin beam_test_plugin = 
> test_config:BeamTestPlugin: (pyasn1 0.3.3 
> (/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
>  Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
>   RuntimeWarning)
> usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>or: setup.py --help [cmd1 cmd2 ...]
>or: setup.py --help-commands
>or: setup.py cmd --help
> error: option --test-pipeline-options not recognized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2813) error: option --test-pipeline-options not recognized

2017-08-28 Thread Ahmet Altay (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144502#comment-16144502
 ] 

Ahmet Altay commented on BEAM-2813:
---

Sounds good. Can we prevent this problem in the future? (e.g. by pinning a 
dependency?)

> error: option --test-pipeline-options not recognized
> 
>
> Key: BEAM-2813
> URL: https://issues.apache.org/jira/browse/BEAM-2813
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Mark Liu
>
> Python post commits 3004 to 3008 (all 5) failed with this error, but somehow 
> fixed in 3009. Mark do you know what might be causing this?
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3004/
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3009/
> The error is:
> # Run ValidatesRunner tests on Google Cloud Dataflow service
> echo ">>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS"
> >>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS
> python setup.py nosetests \
>   --attr ValidatesRunner \
>   --nocapture \
>   --processes=4 \
>   --process-timeout=900 \
>   --test-pipeline-options=" \
> --runner=TestDataflowRunner \
> --project=$PROJECT \
> --staging_location=$GCS_LOCATION/staging-validatesrunner-test \
> --temp_location=$GCS_LOCATION/temp-validatesrunner-test \
> --sdk_location=$SDK_LOCATION \
> --requirements_file=postcommit_requirements.txt \
> --num_workers=1"
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/local/lib/python2.7/site-packages/setuptools/dist.py:341:
>  UserWarning: Normalizing '2.2.0.dev' to '2.2.0.dev0'
>   normalized_version,
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/.eggs/nose-1.3.7-py2.7.egg/nose/plugins/manager.py:395:
>  RuntimeWarning: Unable to load plugin beam_test_plugin = 
> test_config:BeamTestPlugin: (pyasn1 0.3.3 
> (/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
>  Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
>   RuntimeWarning)
> usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>or: setup.py --help [cmd1 cmd2 ...]
>or: setup.py --help-commands
>or: setup.py cmd --help
> error: option --test-pipeline-options not recognized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (BEAM-2813) error: option --test-pipeline-options not recognized

2017-08-28 Thread Mark Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144500#comment-16144500
 ] 

Mark Liu edited comment on BEAM-2813 at 8/28/17 10:58 PM:
--

I think this is a version mismatch in dependencies. This line may related to 
the actual error:
{code}
/nose/plugins/manager.py:395: RuntimeWarning: Unable to load plugin 
beam_test_plugin = test_config:BeamTestPlugin: (pyasn1 0.3.3 
(/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
 Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
RuntimeWarning)
{code}

Both pyasn1 and pyasn1-modules are indirect dependencies and used in many GCP 
packages. Two days ago, pyasn1 was updated to 0.3.3. However, 
pyasn1-modules(0.0.11) only requires pyasn1 0.3.2. I guess when nose upload 
customized plugin, it also examined the existing packages and found versions 
conflict.

Yesterday, pyasn1-modules was updated to 
[0.1.1|https://pypi.python.org/pypi/pyasn1-modules] which fix this problem by 
pointing to latest pyasn1 (0.3.3). So Jenkins started passing.


was (Author: markflyhigh):
I think this is a version mismatch in dependencies. This line may related to 
the actual error:
{code}
/nose/plugins/manager.py:395: RuntimeWarning: Unable to load plugin 
beam_test_plugin = test_config:BeamTestPlugin: (pyasn1 0.3.3 
(/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
 Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
RuntimeWarning)
{code}

Both pyasn1 and pyasn1-modules are indirect dependencies for many GCP packages. 
Two days ago, pyasn1 was updated to 0.3.3. However, pyasn1-modules(0.0.11) only 
requires pyasn1 0.3.2. I guess when nose upload customized plugin, it also 
examined the existing packages and found versions conflict.

Yesterday, pyasn1-modules was updated to 
[0.1.1|https://pypi.python.org/pypi/pyasn1-modules] which fix this problem by 
pointing to latest pyasn1 (0.3.3). So Jenkins started passing.

> error: option --test-pipeline-options not recognized
> 
>
> Key: BEAM-2813
> URL: https://issues.apache.org/jira/browse/BEAM-2813
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Mark Liu
>
> Python post commits 3004 to 3008 (all 5) failed with this error, but somehow 
> fixed in 3009. Mark do you know what might be causing this?
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3004/
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3009/
> The error is:
> # Run ValidatesRunner tests on Google Cloud Dataflow service
> echo ">>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS"
> >>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS
> python setup.py nosetests \
>   --attr ValidatesRunner \
>   --nocapture \
>   --processes=4 \
>   --process-timeout=900 \
>   --test-pipeline-options=" \
> --runner=TestDataflowRunner \
> --project=$PROJECT \
> --staging_location=$GCS_LOCATION/staging-validatesrunner-test \
> --temp_location=$GCS_LOCATION/temp-validatesrunner-test \
> --sdk_location=$SDK_LOCATION \
> --requirements_file=postcommit_requirements.txt \
> --num_workers=1"
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/local/lib/python2.7/site-packages/setuptools/dist.py:341:
>  UserWarning: Normalizing '2.2.0.dev' to '2.2.0.dev0'
>   normalized_version,
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/.eggs/nose-1.3.7-py2.7.egg/nose/plugins/manager.py:395:
>  RuntimeWarning: Unable to load plugin beam_test_plugin = 
> test_config:BeamTestPlugin: (pyasn1 0.3.3 
> (/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
>  Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
>   RuntimeWarning)
> usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>or: setup.py --help [cmd1 cmd2 ...]
>or: setup.py --help-commands
>or: setup.py cmd --help
> error: option --test-pipeline-options not recognized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2813) error: option --test-pipeline-options not recognized

2017-08-28 Thread Mark Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144500#comment-16144500
 ] 

Mark Liu commented on BEAM-2813:


I think this is a version mismatch in dependencies. This line may related to 
the actual error:
{code}
/nose/plugins/manager.py:395: RuntimeWarning: Unable to load plugin 
beam_test_plugin = test_config:BeamTestPlugin: (pyasn1 0.3.3 
(/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
 Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
RuntimeWarning)
{code}

Both pyasn1 and pyasn1-modules are indirect dependencies for many GCP packages. 
Two days ago, pyasn1 was updated to 0.3.3. However, pyasn1-modules(0.0.11) only 
requires pyasn1 0.3.2. I guess when nose upload customized plugin, it also 
examined the existing packages and found versions conflict.

Yesterday, pyasn1-modules was updated to 
[0.1.1|https://pypi.python.org/pypi/pyasn1-modules] which fix this problem by 
pointing to latest pyasn1 (0.3.3). So Jenkins started passing.

> error: option --test-pipeline-options not recognized
> 
>
> Key: BEAM-2813
> URL: https://issues.apache.org/jira/browse/BEAM-2813
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Mark Liu
>
> Python post commits 3004 to 3008 (all 5) failed with this error, but somehow 
> fixed in 3009. Mark do you know what might be causing this?
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3004/
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3009/
> The error is:
> # Run ValidatesRunner tests on Google Cloud Dataflow service
> echo ">>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS"
> >>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS
> python setup.py nosetests \
>   --attr ValidatesRunner \
>   --nocapture \
>   --processes=4 \
>   --process-timeout=900 \
>   --test-pipeline-options=" \
> --runner=TestDataflowRunner \
> --project=$PROJECT \
> --staging_location=$GCS_LOCATION/staging-validatesrunner-test \
> --temp_location=$GCS_LOCATION/temp-validatesrunner-test \
> --sdk_location=$SDK_LOCATION \
> --requirements_file=postcommit_requirements.txt \
> --num_workers=1"
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/local/lib/python2.7/site-packages/setuptools/dist.py:341:
>  UserWarning: Normalizing '2.2.0.dev' to '2.2.0.dev0'
>   normalized_version,
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/.eggs/nose-1.3.7-py2.7.egg/nose/plugins/manager.py:395:
>  RuntimeWarning: Unable to load plugin beam_test_plugin = 
> test_config:BeamTestPlugin: (pyasn1 0.3.3 
> (/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
>  Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
>   RuntimeWarning)
> usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>or: setup.py --help [cmd1 cmd2 ...]
>or: setup.py --help-commands
>or: setup.py cmd --help
> error: option --test-pipeline-options not recognized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] beam pull request #3776: Added snippet tags for documentation

2017-08-28 Thread davidcavazos

GitHub user davidcavazos opened a pull request:

https://github.com/apache/beam/pull/3776

Added snippet tags for documentation

Hi @aaltay,

Adding tags to extract code snippets for the documentation. These are 
following the same sections as in the Java version.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davidcavazos/beam examples_tags

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3776.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3776


commit 2a5004344718add0bd2c9b84972cf69909e18f96
Author: David Cavazos 
Date:   2017-08-28T22:24:20Z

Added snippet tags for documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[beam-site] branch asf-site updated: Add content for new post, missed by automatic deployment

2017-08-28 Thread kenn

This is an automated email from the ASF dual-hosted git repository.

kenn pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new f8e5099  Add content for new post, missed by automatic deployment
 new bf98022  This closes #306: Add content for new post, missed by 
automatic deployment
f8e5099 is described below

commit f8e509944a7cc0a67eebb66e81b94fcda7cff283
Author: Kenneth Knowles 
AuthorDate: Mon Aug 28 14:49:34 2017 -0700

Add content for new post, missed by automatic deployment
---
 content/blog/2017/08/28/timely-processing.html | 714 +
 .../blog/timely-processing/BatchedRpcExpiry.png| Bin 0 -> 43015 bytes
 .../blog/timely-processing/BatchedRpcStale.png | Bin 0 -> 51523 bytes
 .../blog/timely-processing/BatchedRpcState.png | Bin 0 -> 32633 bytes
 .../blog/timely-processing/CombinePerKey.png   | Bin 0 -> 31517 bytes
 content/images/blog/timely-processing/ParDo.png| Bin 0 -> 28247 bytes
 .../blog/timely-processing/StateAndTimers.png  | Bin 0 -> 21355 bytes
 .../images/blog/timely-processing/UnifiedModel.png | Bin 0 -> 39982 bytes
 .../blog/timely-processing/WindowingChoices.png| Bin 0 -> 20877 bytes
 9 files changed, 714 insertions(+)

diff --git a/content/blog/2017/08/28/timely-processing.html 
b/content/blog/2017/08/28/timely-processing.html
new file mode 100644
index 000..f793098
--- /dev/null
+++ b/content/blog/2017/08/28/timely-processing.html
@@ -0,0 +1,714 @@
+
+
+  
+  
+  
+  
+  Timely (and Stateful) Processing with Apache Beam
+  
+  https://fonts.googleapis.com/css?family=Roboto:100,300,400; 
rel="stylesheet">
+  
+  https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";>
+  
+  
+  https://beam.apache.org/blog/2017/08/28/timely-processing.html; 
data-proofer-ignore>
+  
+  https://beam.apache.org/feed.xml;>
+  
+
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+ga('create', 'UA-73650088-1', 'auto');
+ga('send', 'pageview');
+  
+
+
+  
+
+
+  
+
+  
+
+  Toggle navigation
+  
+  
+  
+
+
+
+  
+
+  Get Started 
+  
+Beam Overview
+Quickstart - 
Java
+Quickstart - 
Python
+
+Example Walkthroughs
+WordCount
+Mobile 
Gaming
+
+Resources
+Downloads
+Support
+  
+
+
+  Documentation 
+  
+Using the Documentation
+
+Beam Concepts
+Programming 
Guide
+Additional 
Resources
+
+Pipeline Fundamentals
+Design Your 
Pipeline
+Create Your 
Pipeline
+Test 
Your Pipeline
+Pipeline I/O
+
+SDKs
+Java SDK
+Java SDK API Reference 
+
+Python SDK
+Python SDK API Reference 
+
+
+Runners
+Capability 
Matrix
+Direct Runner
+Apache Apex 
Runner
+Apache Flink 
Runner
+Apache Spark 
Runner
+Cloud Dataflow 
Runner
+
+
+DSLs
+SQL
+  
+
+
+  Contribute 
+  
+Get Started Contributing
+
+Guides
+Contribution 
Guide
+Testing Guide
+Release Guide
+PTransform Style 
Guide
+Runner Authoring 
Guide
+
+Technical References
+Design 
Principles
+Ongoing 
Projects
+Source 
Repository
+
+Promotion
+Presentation 
Materials
+Logos and Design
+
+Maturity Model
+Team
+  
+
+
+Blog
+  
+  
+
+  https://www.apache.org/foundation/press/kit/feather_small.png; alt="Apache 
Logo" style="height:20px;">
+  
+http://www.apache.org/;>ASF Homepage
+http://www.apache.org/licenses/;>License
+http://www.apache.org/security/;>Security
+http://www.apache.org/foundation/thanks.html;>Thanks
+http://www.apache.org/foundation/sponsorship.html;>Sponsorship
+https://www.apache.org/foundation/policies/conduct;>Code of 
Conduct
+  
+
+  
+
+
+
+
+  
+
+http://schema.org/BlogPosting;>
+
+

[jira] [Resolved] (BEAM-541) Add more documentation on DoFn Annotations

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-541.
--
   Resolution: Duplicate
Fix Version/s: Not applicable

> Add more documentation on DoFn Annotations
> --
>
> Key: BEAM-541
> URL: https://issues.apache.org/jira/browse/BEAM-541
> Project: Beam
>  Issue Type: Wish
>  Components: website
>Reporter: Frances Perry
>Assignee: Kenneth Knowles
>Priority: Minor
> Fix For: Not applicable
>
>
> https://github.com/apache/incubator-beam-site/pull/36 made the basic 
> documentation changes that correspond to BEAM-498, but we should add more 
> details on how to use the advance configurations for window access, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

2017-08-28 Thread Peter Hausel (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144399#comment-16144399
 ] 

Peter Hausel commented on BEAM-2815:


[~altay] yes, if there is anything we can do from our end, we would be happy to 
help. I looked at the experimental RPC runner just to have a sense where the 
`Runner` abstraction is going but likely would need more context.

> Python DirectRunner is unusable with input files in the 100-250MB range
> ---
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct, sdk-py
>Affects Versions: 2.1.0
> Environment: python 2.7.10, beam 2.1, os x 
>Reporter: Peter Hausel
>Assignee: Ahmet Altay
> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with 
> training data sets that are bigger than tiny samples - making serious local 
> development impossible or very cumbersome. I am aware of some of the 
> limitations of the current DirectRunner implementation[1][2][3], however I 
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the 
> process after 10 minutes or so (screenshots about high memory and CPU 
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>  """Main entry point; defines and runs the pipeline."""
>  parser = argparse.ArgumentParser()
>  parser.add_argument('--input',
>   dest='input',
>   
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>   help='Input file to process.')
>  known_args, pipeline_args = parser.parse_known_args(argv)
>  pipeline_options = PipelineOptions(pipeline_args)
>  pipeline_options.view_as(SetupOptions).save_main_session = True
>  pipeline = beam.Pipeline(options=pipeline_options)
>  raw_data = (
>pipeline
>| 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
> skip_header_lines=1)
>| 'Map' >> beam.Map(lambda line: line.lower())
>  )
>  result = pipeline.run()
>  result.wait_until_finish()
>  print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in 
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (BEAM-2333) Rehydrate Pipeline from Runner API proto

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-2333.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

I think, since the DirectRunner now does this, we can consider the first draft 
complete.

> Rehydrate Pipeline from Runner API proto
> 
>
> Key: BEAM-2333
> URL: https://issues.apache.org/jira/browse/BEAM-2333
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: beam-python-everywhere
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (BEAM-2787) IllegalArgumentException with MongoDbIO with empty PCollection

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-2787:
-

Assignee: (was: Kenneth Knowles)

> IllegalArgumentException with MongoDbIO with empty PCollection
> --
>
> Key: BEAM-2787
> URL: https://issues.apache.org/jira/browse/BEAM-2787
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Affects Versions: 2.0.0
>Reporter: Siddharth Mittal
>
> We have read a file and created a PCollection of Strings where each record 
> represent one line on the file. There after we have multiple PTransforms to 
> validate the records. In the end The Pcollection was filtered into two 
> PCollection , one with all valid records and one with all invalid records. 
> Now both the pcollections are stored to respective mongo db collections . In 
> case if any of these pcollection is empty we are facing below exception trace 
> :
> 17/08/11 05:30:20 INFO SparkContext: Successfully stopped SparkContext
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.IllegalArgumentException: state should be: writes is not an empty 
> list
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at 
> com.hsbc.rsl.applications.DataTransformation.StoreRiskToL3(DataTransformation.java:106)
> at 
> com.hsbc.rsl.applications.DataTransformation.lambda$executeAndWaitUntilFinish$5(DataTransformation.java:55)
> at 
> com.hsbc.rsl.applications.DataTransformation$$Lambda$10/737852016.accept(Unknown
>  Source)
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
> at 
> java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
> at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
> at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> at 
> com.hsbc.rsl.applications.DataTransformation.executeAndWaitUntilFinish(DataTransformation.java:51)
> at 
> com.hsbc.rsl.applications.TransformationProcessor.main(TransformationProcessor.java:23)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.IllegalArgumentException: state should be: writes is not 
> an empty list
> at 
> com.mongodb.assertions.Assertions.isTrueArgument(Assertions.java:99)
> at 
> com.mongodb.operation.MixedBulkWriteOperation.(MixedBulkWriteOperation.java:95)
> at 
> com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:323)
> at 
> com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:311)
> at 
> org.apache.beam.sdk.io.mongodb.MongoDbIO$Write$WriteFn.flush(MongoDbIO.java:513)
> at 
> org.apache.beam.sdk.io.mongodb.MongoDbIO$Write$WriteFn.finishBundle(MongoDbIO.java:506)
> 17/08/11 05:30:20 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
> shut down.
> There is no exception when we have at least one record in the pcollection.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (BEAM-2787) IllegalArgumentException with MongoDbIO with empty PCollection

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-2787:
--
Component/s: (was: beam-model)
 sdk-java-extensions

> IllegalArgumentException with MongoDbIO with empty PCollection
> --
>
> Key: BEAM-2787
> URL: https://issues.apache.org/jira/browse/BEAM-2787
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Affects Versions: 2.0.0
>Reporter: Siddharth Mittal
>
> We have read a file and created a PCollection of Strings where each record 
> represent one line on the file. There after we have multiple PTransforms to 
> validate the records. In the end The Pcollection was filtered into two 
> PCollection , one with all valid records and one with all invalid records. 
> Now both the pcollections are stored to respective mongo db collections . In 
> case if any of these pcollection is empty we are facing below exception trace 
> :
> 17/08/11 05:30:20 INFO SparkContext: Successfully stopped SparkContext
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.IllegalArgumentException: state should be: writes is not an empty 
> list
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at 
> com.hsbc.rsl.applications.DataTransformation.StoreRiskToL3(DataTransformation.java:106)
> at 
> com.hsbc.rsl.applications.DataTransformation.lambda$executeAndWaitUntilFinish$5(DataTransformation.java:55)
> at 
> com.hsbc.rsl.applications.DataTransformation$$Lambda$10/737852016.accept(Unknown
>  Source)
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
> at 
> java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
> at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
> at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> at 
> com.hsbc.rsl.applications.DataTransformation.executeAndWaitUntilFinish(DataTransformation.java:51)
> at 
> com.hsbc.rsl.applications.TransformationProcessor.main(TransformationProcessor.java:23)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.IllegalArgumentException: state should be: writes is not 
> an empty list
> at 
> com.mongodb.assertions.Assertions.isTrueArgument(Assertions.java:99)
> at 
> com.mongodb.operation.MixedBulkWriteOperation.(MixedBulkWriteOperation.java:95)
> at 
> com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:323)
> at 
> com.mongodb.MongoCollectionImpl.insertMany(MongoCollectionImpl.java:311)
> at 
> org.apache.beam.sdk.io.mongodb.MongoDbIO$Write$WriteFn.flush(MongoDbIO.java:513)
> at 
> org.apache.beam.sdk.io.mongodb.MongoDbIO$Write$WriteFn.finishBundle(MongoDbIO.java:506)
> 17/08/11 05:30:20 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
> shut down.
> There is no exception when we have at least one record in the pcollection.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2705) DoFnTester supports for StateParameter

2017-08-28 Thread Kenneth Knowles (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144391#comment-16144391
 ] 

Kenneth Knowles commented on BEAM-2705:
---

This is a pretty good idea - just unassigning it so someone else can work on it.

> DoFnTester supports for StateParameter
> --
>
> Key: BEAM-2705
> URL: https://issues.apache.org/jira/browse/BEAM-2705
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Affects Versions: 2.0.0
>Reporter: Yihua Eric Fang
>
> Today DoFnTester does not support StateParameters such as ValueState. I 
> didn't see an issue being created on JIRA, so filing this one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-1957) Missing DoFn annotations documentation

2017-08-28 Thread Kenneth Knowles (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144390#comment-16144390
 ] 

Kenneth Knowles commented on BEAM-1957:
---

Yes, we also need to cover all the state and timers APIs. I think updates to 
the guide should be coordinated with [~melap] who might know what is already in 
the works.

> Missing DoFn annotations documentation
> --
>
> Key: BEAM-1957
> URL: https://issues.apache.org/jira/browse/BEAM-1957
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Aviem Zur
>
> Not all {{DoFn}} annotations are covered by the programming guide.
> Only {{@ProcessElement}} is currently covered.
> We should have documentation for the other (non-expermintal at least) 
> annotations:
> {code}
> public @interface Setup
> public @interface StartBundle
> public @interface FinishBundle
> public @interface Teardown
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (BEAM-1957) Missing DoFn annotations documentation

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-1957:
-

Assignee: (was: Kenneth Knowles)

> Missing DoFn annotations documentation
> --
>
> Key: BEAM-1957
> URL: https://issues.apache.org/jira/browse/BEAM-1957
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Aviem Zur
>
> Not all {{DoFn}} annotations are covered by the programming guide.
> Only {{@ProcessElement}} is currently covered.
> We should have documentation for the other (non-expermintal at least) 
> annotations:
> {code}
> public @interface Setup
> public @interface StartBundle
> public @interface FinishBundle
> public @interface Teardown
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-1062) Shade SDK based on a whitelist instead of a blacklist

2017-08-28 Thread Kenneth Knowles (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144387#comment-16144387
 ] 

Kenneth Knowles commented on BEAM-1062:
---

The PR is still out there, but I haven't had time to look at it in a while. 
Someone else could definitely pick this up.

> Shade SDK based on a whitelist instead of a blacklist
> -
>
> Key: BEAM-1062
> URL: https://issues.apache.org/jira/browse/BEAM-1062
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>
> This is a more robust way to manage the surface of dependencies we introduce.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (BEAM-2674) Runner API translators should own their rehydration

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-2674:
-

Assignee: (was: Kenneth Knowles)

> Runner API translators should own their rehydration
> ---
>
> Key: BEAM-2674
> URL: https://issues.apache.org/jira/browse/BEAM-2674
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Kenneth Knowles
>
> This allows, for example, that a `ParDo` and `Combine` when rehydrated can 
> have its additional inputs set up correctly. Also that `Combine` and others 
> with auxiliary coders can register them appropriately. Currently these happen 
> via sort of hackish one-off methods for each purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[beam-site] branch asf-site updated (6386ac2 -> 5bd80e6)

2017-08-28 Thread mergebot-role

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git.


from 6386ac2  Prepare repository for deployment.
 add 155c697  Add blog post with timely processing
 add 7a775c1  This closes #296
 new 5bd80e6  Prepare repository for deployment.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/blog/index.html|  37 ++
 content/feed.xml   | 565 +++--
 content/index.html |  10 +-
 src/_posts/2017-08-28-timely-processing.md | 517 +++
 .../blog/timely-processing/BatchedRpcExpiry.png| Bin 0 -> 43015 bytes
 .../blog/timely-processing/BatchedRpcStale.png | Bin 0 -> 51523 bytes
 .../blog/timely-processing/BatchedRpcState.png | Bin 0 -> 32633 bytes
 .../blog/timely-processing/CombinePerKey.png   | Bin 0 -> 31517 bytes
 src/images/blog/timely-processing/ParDo.png| Bin 0 -> 28247 bytes
 .../blog/timely-processing/StateAndTimers.png  | Bin 0 -> 21355 bytes
 src/images/blog/timely-processing/UnifiedModel.png | Bin 0 -> 39982 bytes
 .../blog/timely-processing/WindowingChoices.png| Bin 0 -> 20877 bytes
 12 files changed, 1073 insertions(+), 56 deletions(-)
 create mode 100644 src/_posts/2017-08-28-timely-processing.md
 create mode 100644 src/images/blog/timely-processing/BatchedRpcExpiry.png
 create mode 100644 src/images/blog/timely-processing/BatchedRpcStale.png
 create mode 100644 src/images/blog/timely-processing/BatchedRpcState.png
 create mode 100644 src/images/blog/timely-processing/CombinePerKey.png
 create mode 100644 src/images/blog/timely-processing/ParDo.png
 create mode 100644 src/images/blog/timely-processing/StateAndTimers.png
 create mode 100644 src/images/blog/timely-processing/UnifiedModel.png
 create mode 100644 src/images/blog/timely-processing/WindowingChoices.png

-- 
To stop receiving notification emails like this one, please contact
['"commits@beam.apache.org" '].

[beam-site] 01/01: Prepare repository for deployment.

2017-08-28 Thread mergebot-role

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 5bd80e6ef06e2f0c9920b9477481307c27499361
Author: Mergebot 
AuthorDate: Mon Aug 28 21:38:09 2017 +

Prepare repository for deployment.
---
 content/blog/index.html |  37 
 content/feed.xml| 565 +++-
 content/index.html  |  10 +-
 3 files changed, 556 insertions(+), 56 deletions(-)

diff --git a/content/blog/index.html b/content/blog/index.html
index ab0127d..8d9fc46 100644
--- a/content/blog/index.html
+++ b/content/blog/index.html
@@ -146,6 +146,43 @@
 This is the blog for the Apache Beam project. This blog contains news and 
updates
 for the project.
 
+Timely (and 
Stateful) Processing with Apache Beam
+Aug 28, 2017 •  Kenneth Knowles [https://twitter.com/KennKnowles;>@KennKnowles]
+
+
+In a prior blog
+post, I
+introduced the basics of stateful processing in Apache Beam, focusing on the
+addition of state to per-element processing. So-called timely 
processing
+complements stateful processing in Beam by letting you set timers to request a
+(stateful) callback at some point in the future.
+
+What can you do with timers in Beam? Here are some examples:
+
+
+  You can output data buffered in state after some amount of processing 
time.
+  You can take special action when the watermark estimates that you have
+received all data up to a specified point in event time.
+  You can author workflows with timeouts that alter state and emit output 
in
+response to the absence of additional input for some period of time.
+
+
+These are just a few possibilities. State and timers together form a 
powerful
+programming paradigm for fine-grained control to express a huge variety of
+workflows.  Stateful and timely processing in Beam is portable across data
+processing engines and integrated with Beam’s unified model of event time
+windowing in both streaming and batch processing.
+
+
+
+
+
+Read more
+
+
+
+
+
 Powerful and 
modular IO connectors with Splittable DoFn in Apache Beam
 Aug 16, 2017 •  Eugene Kirpichov 
 
diff --git a/content/feed.xml b/content/feed.xml
index 5e3ee2a..eb789b8 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -9,6 +9,520 @@
 Jekyll v3.2.0
 
   
+Timely (and Stateful) Processing with Apache Beam
+pIn a a 
href=/blog/2017/02/13/stateful-processing.htmlprior blog
+post/a, I
+introduced the basics of stateful processing in Apache Beam, focusing on the
+addition of state to per-element processing. So-called 
emtimely/em processing
+complements stateful processing in Beam by letting you set timers to request a
+(stateful) callback at some point in the future./p
+
+pWhat can you do with timers in Beam? Here are some examples:/p
+
+ul
+  liYou can output data buffered in state after some amount of 
processing time./li
+  liYou can take special action when the watermark estimates that you 
have
+received all data up to a specified point in event time./li
+  liYou can author workflows with timeouts that alter state and emit 
output in
+response to the absence of additional input for some period of time./li
+/ul
+
+pThese are just a few possibilities. State and timers together form a 
powerful
+programming paradigm for fine-grained control to express a huge variety of
+workflows.  Stateful and timely processing in Beam is portable across data
+processing engines and integrated with Beam’s unified model of event time
+windowing in both streaming and batch processing./p
+
+!--more--
+
+h2 id=what-is-stateful-and-timely-processingWhat is 
stateful and timely processing?/h2
+
+pIn my prior post, I developed an understanding of stateful processing 
largely
+by contrast with associative, commutative combiners. In this post, I’ll
+emphasize a perspective that I had mentioned only briefly: that elementwise
+processing with access to per-key-and-window state and timers represents a
+fundamental pattern for “embarrassingly parallel” computation, distinct from
+the others in Beam./p
+
+pIn fact, stateful and timely computation is the low-level 
computational pattern
+that underlies the others. Precisely because it is lower level, it allows you
+to really micromanage your computations to unlock new use cases and new
+efficiencies. This incurs the complexity of manually managing your state and
+timers - it isn’t magic! Let’s first look again at the two primary
+computational patterns in Beam./p
+
+h3 id=element-wise-processing-pardo-map-etcElement-wise 
processing (ParDo, Map, etc)/h3
+
+pThe most elementary embarrassingly parallel pattern is just using a 
bunch of
+computers to apply the same function to every input element of a massive
+collection. In Beam, per-element processing like this is expressed as a basic
+code class=highlighter-rougeParDo/code - analogous 
to “Map” from MapReduce - which is like

[jira] [Commented] (BEAM-2807) NullPointerException during checkpoint on FlinkRunner

2017-08-28 Thread Kenneth Knowles (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144384#comment-16144384
 ] 

Kenneth Knowles commented on BEAM-2807:
---

I'll be somewhat less available for the next week, and anyhow reassigning it to 
a better person for FlinkRunner questions.

> NullPointerException during checkpoint on FlinkRunner
> -
>
> Key: BEAM-2807
> URL: https://issues.apache.org/jira/browse/BEAM-2807
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 2.1.0
>Reporter: Daniel Harper
>Assignee: Aljoscha Krettek
>Priority: Blocker
>
> *Beam version:* 2.1.0
> *Runner:* FlinkRunner
> We're seeing the following exception when checkpointing, which is causing our 
> job to restart
> {code}
> 2017-08-25 09:42:17,658 INFO  org.apache.flink.runtime.taskmanager.Task   
>   - Combine.perKey(Count) -> ParMultiDo(ToConcurrentStreamResult) 
> -> ParMultiDo(Anonymous) -> ParMultiDo(ApplyShardingKey) -> ToKeyedWorkItem 
> (7/32) (f00a31b722a31030f18d83ac613de21d) switched from RUNNING to FAILED.
> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 2 
> for operator Combine.perKey(Count) -> ParMultiDo(ToConcurrentStreamResult) -> 
> ParMultiDo(Anonymous) -> ParMultiDo(ApplyShardingKey) -> ToKeyedWorkItem 
> (7/32).}
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:966)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize checkpoint 2 for 
> operator Combine.perKey(Count) -> ParMultiDo(ToConcurrentStreamResult) -> 
> ParMultiDo(Anonymous) -> ParMultiDo(ApplyShardingKey) -> ToKeyedWorkItem 
> (7/32).
> ... 6 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:893)
> ... 5 more
> Suppressed: java.lang.Exception: Could not properly cancel managed keyed 
> state future.
> at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1018)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:957)
> ... 5 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
> at 
> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)
> at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)
> ... 7 more
> Caused by: java.lang.NullPointerException
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347)
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
> at 
> org.apache.beam.runners.flink.translation.types.CoderTypeSerializer$CoderTypeSerializerConfigSnapshot.write(CoderTypeSerializer.java:189)
> at 
> org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil$TypeSerializerConfigSnapshotSerializationProxy.write(TypeSerializerSerializationUtil.java:413)
> at 
> org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil.writeSerializerConfigSnapshot(TypeSerializerSerializationUtil.java:229)
> at 
> org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil.writeSerializersAndConfigsWithResilience(TypeSerializerSerializationUtil.java:151)
> at 
> org.apache.flink.runtime.state.KeyedBackendStateMetaInfoSnapshotReaderWriters$KeyedBackendStateMetaInfoWriterV3.writeStateMetaInfo(KeyedBackendStateMetaInfoSnapshotReaderWriters.java:107)
> at 
> org.apache.flink.runtime.state.KeyedBackendSerializationProxy.write(KeyedBackendSerializationProxy.java:104)
> at 
>

[jira] [Assigned] (BEAM-1062) Shade SDK based on a whitelist instead of a blacklist

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-1062:
-

Assignee: (was: Kenneth Knowles)

> Shade SDK based on a whitelist instead of a blacklist
> -
>
> Key: BEAM-1062
> URL: https://issues.apache.org/jira/browse/BEAM-1062
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>
> This is a more robust way to manage the surface of dependencies we introduce.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (BEAM-2758) ParDo should indicate what "features" are used in DisplayData

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-2758:
-

Assignee: (was: Kenneth Knowles)

> ParDo should indicate what "features" are used in DisplayData
> -
>
> Key: BEAM-2758
> URL: https://issues.apache.org/jira/browse/BEAM-2758
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Ben Chambers
>  Labels: newbie
>
> ParDo now exposes numerous features, such as SplittableDoFn, State, Timers, 
> etc. It would be good if the specific features being used where readily 
> visible within the Display Data of the given Pardo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (BEAM-2807) NullPointerException during checkpoint on FlinkRunner

2017-08-28 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-2807:
-

Assignee: Aljoscha Krettek  (was: Kenneth Knowles)

> NullPointerException during checkpoint on FlinkRunner
> -
>
> Key: BEAM-2807
> URL: https://issues.apache.org/jira/browse/BEAM-2807
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 2.1.0
>Reporter: Daniel Harper
>Assignee: Aljoscha Krettek
>Priority: Blocker
>
> *Beam version:* 2.1.0
> *Runner:* FlinkRunner
> We're seeing the following exception when checkpointing, which is causing our 
> job to restart
> {code}
> 2017-08-25 09:42:17,658 INFO  org.apache.flink.runtime.taskmanager.Task   
>   - Combine.perKey(Count) -> ParMultiDo(ToConcurrentStreamResult) 
> -> ParMultiDo(Anonymous) -> ParMultiDo(ApplyShardingKey) -> ToKeyedWorkItem 
> (7/32) (f00a31b722a31030f18d83ac613de21d) switched from RUNNING to FAILED.
> AsynchronousException{java.lang.Exception: Could not materialize checkpoint 2 
> for operator Combine.perKey(Count) -> ParMultiDo(ToConcurrentStreamResult) -> 
> ParMultiDo(Anonymous) -> ParMultiDo(ApplyShardingKey) -> ToKeyedWorkItem 
> (7/32).}
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:966)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize checkpoint 2 for 
> operator Combine.perKey(Count) -> ParMultiDo(ToConcurrentStreamResult) -> 
> ParMultiDo(Anonymous) -> ParMultiDo(ApplyShardingKey) -> ToKeyedWorkItem 
> (7/32).
> ... 6 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:893)
> ... 5 more
> Suppressed: java.lang.Exception: Could not properly cancel managed keyed 
> state future.
> at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1018)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:957)
> ... 5 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
> at 
> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)
> at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)
> ... 7 more
> Caused by: java.lang.NullPointerException
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347)
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
> at 
> org.apache.beam.runners.flink.translation.types.CoderTypeSerializer$CoderTypeSerializerConfigSnapshot.write(CoderTypeSerializer.java:189)
> at 
> org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil$TypeSerializerConfigSnapshotSerializationProxy.write(TypeSerializerSerializationUtil.java:413)
> at 
> org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil.writeSerializerConfigSnapshot(TypeSerializerSerializationUtil.java:229)
> at 
> org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil.writeSerializersAndConfigsWithResilience(TypeSerializerSerializationUtil.java:151)
> at 
> org.apache.flink.runtime.state.KeyedBackendStateMetaInfoSnapshotReaderWriters$KeyedBackendStateMetaInfoWriterV3.writeStateMetaInfo(KeyedBackendStateMetaInfoSnapshotReaderWriters.java:107)
> at 
> org.apache.flink.runtime.state.KeyedBackendSerializationProxy.write(KeyedBackendSerializationProxy.java:104)
> at 
> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:293)
> at 
>

[beam-site] 01/02: Add blog post with timely processing

2017-08-28 Thread mergebot-role

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 155c6979f5ab2609ab74f9bc1ac61a8c60ddd9e7
Author: Kenneth Knowles 
AuthorDate: Tue Aug 15 20:42:15 2017 -0700

Add blog post with timely processing
---
 src/_posts/2017-08-28-timely-processing.md | 517 +
 .../blog/timely-processing/BatchedRpcExpiry.png| Bin 0 -> 43015 bytes
 .../blog/timely-processing/BatchedRpcStale.png | Bin 0 -> 51523 bytes
 .../blog/timely-processing/BatchedRpcState.png | Bin 0 -> 32633 bytes
 .../blog/timely-processing/CombinePerKey.png   | Bin 0 -> 31517 bytes
 src/images/blog/timely-processing/ParDo.png| Bin 0 -> 28247 bytes
 .../blog/timely-processing/StateAndTimers.png  | Bin 0 -> 21355 bytes
 src/images/blog/timely-processing/UnifiedModel.png | Bin 0 -> 39982 bytes
 .../blog/timely-processing/WindowingChoices.png| Bin 0 -> 20877 bytes
 9 files changed, 517 insertions(+)

diff --git a/src/_posts/2017-08-28-timely-processing.md 
b/src/_posts/2017-08-28-timely-processing.md
new file mode 100644
index 000..d02693a
--- /dev/null
+++ b/src/_posts/2017-08-28-timely-processing.md
@@ -0,0 +1,517 @@
+---
+layout: post
+title:  "Timely (and Stateful) Processing with Apache Beam"
+date:   2017-08-28 00:00:01 -0800
+excerpt_separator: 
+categories: blog
+authors:
+  - klk
+---
+
+In a [prior blog
+post]({{ site.baseurl }}/blog/2017/02/13/stateful-processing.html), I
+introduced the basics of stateful processing in Apache Beam, focusing on the
+addition of state to per-element processing. So-called _timely_ processing
+complements stateful processing in Beam by letting you set timers to request a
+(stateful) callback at some point in the future.
+
+What can you do with timers in Beam? Here are some examples:
+
+ - You can output data buffered in state after some amount of processing time.
+ - You can take special action when the watermark estimates that you have
+   received all data up to a specified point in event time.
+ - You can author workflows with timeouts that alter state and emit output in
+   response to the absence of additional input for some period of time.
+
+These are just a few possibilities. State and timers together form a powerful
+programming paradigm for fine-grained control to express a huge variety of
+workflows.  Stateful and timely processing in Beam is portable across data
+processing engines and integrated with Beam's unified model of event time
+windowing in both streaming and batch processing.
+
+
+
+## What is stateful and timely processing?
+
+In my prior post, I developed an understanding of stateful processing largely
+by contrast with associative, commutative combiners. In this post, I'll
+emphasize a perspective that I had mentioned only briefly: that elementwise
+processing with access to per-key-and-window state and timers represents a
+fundamental pattern for "embarrassingly parallel" computation, distinct from
+the others in Beam.
+
+In fact, stateful and timely computation is the low-level computational pattern
+that underlies the others. Precisely because it is lower level, it allows you
+to really micromanage your computations to unlock new use cases and new
+efficiencies. This incurs the complexity of manually managing your state and
+timers - it isn't magic! Let's first look again at the two primary
+computational patterns in Beam.
+
+### Element-wise processing (ParDo, Map, etc)
+
+The most elementary embarrassingly parallel pattern is just using a bunch of
+computers to apply the same function to every input element of a massive
+collection. In Beam, per-element processing like this is expressed as a basic
+`ParDo` - analogous to "Map" from MapReduce - which is like an enhanced "map",
+"flatMap", etc, from functional programming.
+
+The following diagram illustrates per-element processing. Input elements are
+squares, output elements are triangles. The colors of the elements represent
+their key, which will matter later. Each input element maps to the
+corresponding output element(s) completely independently. Processing may be
+distributed across computers in any way, yielding essentially limitless
+parallelism.
+
+
+
+This pattern is obvious, exists in all data-parallel paradigms, and has
+a simple stateless implementation. Every input element can be processed
+independently or in arbitrary bundles. Balancing the work between computers is
+actually the hard part, and can be addressed by splitting, progress estimation,
+work-stealing, etc.
+
+### Per-key (and window) aggregation (Combine, Reduce, GroupByKey, etc.)
+
+The other embarassingly parallel design pattern at the heart of Beam is per-key
+(and window) aggregation. Elements sharing a key are colocated and then
+combined using some associative and commutative operator. In Beam this is
+expressed as a `GroupByKey` or

[beam-site] 02/02: This closes #296

2017-08-28 Thread mergebot-role

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 7a775c109865eed06d86b536750fa8f4896bd6b3
Merge: 6386ac2 155c697
Author: Mergebot 
AuthorDate: Mon Aug 28 21:35:10 2017 +

This closes #296

 src/_posts/2017-08-28-timely-processing.md | 517 +
 .../blog/timely-processing/BatchedRpcExpiry.png| Bin 0 -> 43015 bytes
 .../blog/timely-processing/BatchedRpcStale.png | Bin 0 -> 51523 bytes
 .../blog/timely-processing/BatchedRpcState.png | Bin 0 -> 32633 bytes
 .../blog/timely-processing/CombinePerKey.png   | Bin 0 -> 31517 bytes
 src/images/blog/timely-processing/ParDo.png| Bin 0 -> 28247 bytes
 .../blog/timely-processing/StateAndTimers.png  | Bin 0 -> 21355 bytes
 src/images/blog/timely-processing/UnifiedModel.png | Bin 0 -> 39982 bytes
 .../blog/timely-processing/WindowingChoices.png| Bin 0 -> 20877 bytes
 9 files changed, 517 insertions(+)

-- 
To stop receiving notification emails like this one, please contact
"commits@beam.apache.org" .

[beam-site] branch mergebot updated (ec83da1 -> 7a775c1)

2017-08-28 Thread mergebot-role

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a change to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git.


from ec83da1  This closes #300
 add 6386ac2  Prepare repository for deployment.
 new 155c697  Add blog post with timely processing
 new 7a775c1  This closes #296

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/documentation/dsls/sql/index.html  |  73 ++-
 src/_posts/2017-08-28-timely-processing.md | 517 +
 .../blog/timely-processing/BatchedRpcExpiry.png| Bin 0 -> 43015 bytes
 .../blog/timely-processing/BatchedRpcStale.png | Bin 0 -> 51523 bytes
 .../blog/timely-processing/BatchedRpcState.png | Bin 0 -> 32633 bytes
 .../blog/timely-processing/CombinePerKey.png   | Bin 0 -> 31517 bytes
 src/images/blog/timely-processing/ParDo.png| Bin 0 -> 28247 bytes
 .../blog/timely-processing/StateAndTimers.png  | Bin 0 -> 21355 bytes
 src/images/blog/timely-processing/UnifiedModel.png | Bin 0 -> 39982 bytes
 .../blog/timely-processing/WindowingChoices.png| Bin 0 -> 20877 bytes
 10 files changed, 576 insertions(+), 14 deletions(-)
 create mode 100644 src/_posts/2017-08-28-timely-processing.md
 create mode 100644 src/images/blog/timely-processing/BatchedRpcExpiry.png
 create mode 100644 src/images/blog/timely-processing/BatchedRpcStale.png
 create mode 100644 src/images/blog/timely-processing/BatchedRpcState.png
 create mode 100644 src/images/blog/timely-processing/CombinePerKey.png
 create mode 100644 src/images/blog/timely-processing/ParDo.png
 create mode 100644 src/images/blog/timely-processing/StateAndTimers.png
 create mode 100644 src/images/blog/timely-processing/UnifiedModel.png
 create mode 100644 src/images/blog/timely-processing/WindowingChoices.png

-- 
To stop receiving notification emails like this one, please contact
['"commits@beam.apache.org" '].

[jira] [Updated] (BEAM-2814) test_as_singleton_with_different_defaults test is flaky

2017-08-28 Thread Ahmet Altay (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Altay updated BEAM-2814:
--
Description: 
{{test_as_singleton_with_different_defaults}} is flaky and failed in the post 
commit test 3013, but there is no related change to trigger this.

https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3013/consoleFull
(https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-08-28_11_08_56-17324181904913254210?project=apache-beam-testing)

Dataflow error form the console:
  (b4d390f9f9e033b4): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 582, in do_work
work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
line 166, in execute
op.start()
  File "apache_beam/runners/worker/operations.py", line 294, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10607)
def start(self):
  File "apache_beam/runners/worker/operations.py", line 295, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10501)
with self.scoped_start_state:
  File "apache_beam/runners/worker/operations.py", line 323, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10322)
self.dofn_runner = common.DoFnRunner(
  File "apache_beam/runners/common.py", line 378, in 
apache_beam.runners.common.DoFnRunner.__init__ 
(apache_beam/runners/common.c:10018)
self.do_fn_invoker = DoFnInvoker.create_invoker(
  File "apache_beam/runners/common.py", line 154, in 
apache_beam.runners.common.DoFnInvoker.create_invoker 
(apache_beam/runners/common.c:5212)
return PerWindowInvoker(
  File "apache_beam/runners/common.py", line 219, in 
apache_beam.runners.common.PerWindowInvoker.__init__ 
(apache_beam/runners/common.c:7109)
input_args, input_kwargs, [si[global_window] for si in side_inputs])
  File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/sideinputs.py", 
line 63, in __getitem__
_FilteringIterable(self._iterable, target_window), self._view_options)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.py", line 
332, in _from_runtime_iterable
'PCollection with more than one element accessed as '
ValueError: PCollection with more than one element accessed as a singleton view.

  was:
{{test_as_singleton_with_different_defaults}} is flaky and failed in the post 
commit test 3013, but there is no related change to trigger this.

https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3013/consoleFull
(https://pantheon.corp.google.com/dataflow/job/2017-08-28_11_08_56-17324181904913254210?project=apache-beam-testing)

Dataflow error form the console:
  (b4d390f9f9e033b4): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 582, in do_work
work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
line 166, in execute
op.start()
  File "apache_beam/runners/worker/operations.py", line 294, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10607)
def start(self):
  File "apache_beam/runners/worker/operations.py", line 295, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10501)
with self.scoped_start_state:
  File "apache_beam/runners/worker/operations.py", line 323, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10322)
self.dofn_runner = common.DoFnRunner(
  File "apache_beam/runners/common.py", line 378, in 
apache_beam.runners.common.DoFnRunner.__init__ 
(apache_beam/runners/common.c:10018)
self.do_fn_invoker = DoFnInvoker.create_invoker(
  File "apache_beam/runners/common.py", line 154, in 
apache_beam.runners.common.DoFnInvoker.create_invoker 
(apache_beam/runners/common.c:5212)
return PerWindowInvoker(
  File "apache_beam/runners/common.py", line 219, in 
apache_beam.runners.common.PerWindowInvoker.__init__ 
(apache_beam/runners/common.c:7109)
input_args, input_kwargs, [si[global_window] for si in side_inputs])
  File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/sideinputs.py", 
line 63, in __getitem__
_FilteringIterable(self._iterable, target_window), self._view_options)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.py", line 
332, in _from_runtime_iterable
'PCollection with more than one element accessed as '
ValueError: PCollection with more than one element accessed as a singleton view.


> test_as_singleton_with_different_defaults test is flaky
>

[jira] [Commented] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

2017-08-28 Thread Ahmet Altay (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144358#comment-16144358
 ] 

Ahmet Altay commented on BEAM-2815:
---

This is a known issue and tracked in the issue you mentioned 
(https://issues.apache.org/jira/browse/BEAM-1442)

There are series of improvement that could be applied. Would you be interested 
helping in this area?

cc: [~charleschen]

> Python DirectRunner is unusable with input files in the 100-250MB range
> ---
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct, sdk-py
>Affects Versions: 2.1.0
> Environment: python 2.7.10, beam 2.1, os x 
>Reporter: Peter Hausel
>Assignee: Ahmet Altay
> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with 
> training data sets that are bigger than tiny samples - making serious local 
> development impossible or very cumbersome. I am aware of some of the 
> limitations of the current DirectRunner implementation[1][2][3], however I 
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the 
> process after 10 minutes or so (screenshots about high memory and CPU 
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>  """Main entry point; defines and runs the pipeline."""
>  parser = argparse.ArgumentParser()
>  parser.add_argument('--input',
>   dest='input',
>   
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>   help='Input file to process.')
>  known_args, pipeline_args = parser.parse_known_args(argv)
>  pipeline_options = PipelineOptions(pipeline_args)
>  pipeline_options.view_as(SetupOptions).save_main_session = True
>  pipeline = beam.Pipeline(options=pipeline_options)
>  raw_data = (
>pipeline
>| 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
> skip_header_lines=1)
>| 'Map' >> beam.Map(lambda line: line.lower())
>  )
>  result = pipeline.run()
>  result.wait_until_finish()
>  print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in 
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

2017-08-28 Thread Ahmet Altay (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Altay reassigned BEAM-2815:
-

Assignee: Ahmet Altay  (was: Thomas Groh)

> Python DirectRunner is unusable with input files in the 100-250MB range
> ---
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct, sdk-py
>Affects Versions: 2.1.0
> Environment: python 2.7.10, beam 2.1, os x 
>Reporter: Peter Hausel
>Assignee: Ahmet Altay
> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with 
> training data sets that are bigger than tiny samples - making serious local 
> development impossible or very cumbersome. I am aware of some of the 
> limitations of the current DirectRunner implementation[1][2][3], however I 
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the 
> process after 10 minutes or so (screenshots about high memory and CPU 
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>  """Main entry point; defines and runs the pipeline."""
>  parser = argparse.ArgumentParser()
>  parser.add_argument('--input',
>   dest='input',
>   
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>   help='Input file to process.')
>  known_args, pipeline_args = parser.parse_known_args(argv)
>  pipeline_options = PipelineOptions(pipeline_args)
>  pipeline_options.view_as(SetupOptions).save_main_session = True
>  pipeline = beam.Pipeline(options=pipeline_options)
>  raw_data = (
>pipeline
>| 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
> skip_header_lines=1)
>| 'Map' >> beam.Map(lambda line: line.lower())
>  )
>  result = pipeline.run()
>  result.wait_until_finish()
>  print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in 
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

2017-08-28 Thread Ahmet Altay (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Altay updated BEAM-2815:
--
Component/s: sdk-py

> Python DirectRunner is unusable with input files in the 100-250MB range
> ---
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct, sdk-py
>Affects Versions: 2.1.0
> Environment: python 2.7.10, beam 2.1, os x 
>Reporter: Peter Hausel
>Assignee: Thomas Groh
> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with 
> training data sets that are bigger than tiny samples - making serious local 
> development impossible or very cumbersome. I am aware of some of the 
> limitations of the current DirectRunner implementation[1][2][3], however I 
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the 
> process after 10 minutes or so (screenshots about high memory and CPU 
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>  """Main entry point; defines and runs the pipeline."""
>  parser = argparse.ArgumentParser()
>  parser.add_argument('--input',
>   dest='input',
>   
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>   help='Input file to process.')
>  known_args, pipeline_args = parser.parse_known_args(argv)
>  pipeline_options = PipelineOptions(pipeline_args)
>  pipeline_options.view_as(SetupOptions).save_main_session = True
>  pipeline = beam.Pipeline(options=pipeline_options)
>  raw_data = (
>pipeline
>| 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
> skip_header_lines=1)
>| 'Map' >> beam.Map(lambda line: line.lower())
>  )
>  result = pipeline.run()
>  result.wait_until_finish()
>  print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in 
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

2017-08-28 Thread Peter Hausel (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Hausel updated BEAM-2815:
---
Attachment: Screen Shot 2017-08-27 at 9.00.29 AM.png
Screen Shot 2017-08-27 at 9.06.00 AM.png

> Python DirectRunner is unusable with input files in the 100-250MB range
> ---
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.1.0
> Environment: python 2.7.10, beam 2.1, os x 
>Reporter: Peter Hausel
>Assignee: Thomas Groh
> Attachments: Screen Shot 2017-08-27 at 9.00.29 AM.png, Screen Shot 
> 2017-08-27 at 9.06.00 AM.png
>
>
> The current python DirectRunner implementation seems to be unusable with 
> training data sets that are bigger than tiny samples - making serious local 
> development impossible or very cumbersome. I am aware of some of the 
> limitations of the current DirectRunner implementation[1][2][3], however I 
> was not sure if this odd behavior is expected.
> [1][2][3]
> https://stackoverflow.com/a/44765621
> https://issues.apache.org/jira/browse/BEAM-1442
> https://beam.apache.org/documentation/runners/direct/
> Repro:
> The simple script below blew up my laptop (MBP 2015) and had to terminate the 
> process after 10 minutes or so (screenshots about high memory and CPU 
> utilization are also attached).
> {code}
> from apache_beam.io import textio
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.options.pipeline_options import SetupOptions
> import argparse
> def run(argv=None):
>  """Main entry point; defines and runs the pipeline."""
>  parser = argparse.ArgumentParser()
>  parser.add_argument('--input',
>   dest='input',
>   
> default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
>   help='Input file to process.')
>  known_args, pipeline_args = parser.parse_known_args(argv)
>  pipeline_options = PipelineOptions(pipeline_args)
>  pipeline_options.view_as(SetupOptions).save_main_session = True
>  pipeline = beam.Pipeline(options=pipeline_options)
>  raw_data = (
>pipeline
>| 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
> skip_header_lines=1)
>| 'Map' >> beam.Map(lambda line: line.lower())
>  )
>  result = pipeline.run()
>  result.wait_until_finish()
>  print(raw_data)
> if __name__ == '__main__':
>   run()
> {code}
> Example dataset:  
> https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009
> for comparison: 
> {code}
> lines = [line.lower() for line in 
> open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
> print(len(lines))
> {code}
> this vanilla python script runs on the same hardware and dataset in 0m4.909s. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (BEAM-2815) Python DirectRunner is unusable with input files in the 100-250MB range

2017-08-28 Thread Peter Hausel (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Hausel updated BEAM-2815:
---
Description: 
The current python DirectRunner implementation seems to be unusable with 
training data sets that are bigger than tiny samples - making serious local 
development impossible or very cumbersome. I am aware of some of the 
limitations of the current DirectRunner implementation[1][2][3], however I was 
not sure if this odd behavior is expected.


[1][2][3]
https://stackoverflow.com/a/44765621
https://issues.apache.org/jira/browse/BEAM-1442
https://beam.apache.org/documentation/runners/direct/

Repro:
The simple script below blew up my laptop (MBP 2015) and had to terminate the 
process after 10 minutes or so (screenshots about high memory and CPU 
utilization are also attached).

{code}
from apache_beam.io import textio
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse

def run(argv=None):
 """Main entry point; defines and runs the pipeline."""
 parser = argparse.ArgumentParser()
 parser.add_argument('--input',
  dest='input',
  
default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
  help='Input file to process.')
 known_args, pipeline_args = parser.parse_known_args(argv)
 pipeline_options = PipelineOptions(pipeline_args)
 pipeline_options.view_as(SetupOptions).save_main_session = True
 pipeline = beam.Pipeline(options=pipeline_options)
 raw_data = (
   pipeline
   | 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
skip_header_lines=1)
   | 'Map' >> beam.Map(lambda line: line.lower())
 )
 result = pipeline.run()
 result.wait_until_finish()
 print(raw_data)

if __name__ == '__main__':
  run()
{code}

Example dataset:  
https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009

for comparison: 

{code}
lines = [line.lower() for line in 
open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
print(len(lines))
{code}

this vanilla python script runs on the same hardware and dataset in 0m4.909s. 

  was:
The current python DirectRunner implementation seems to be unusable with 
training data sets that are bigger than tiny samples - making serious local 
development impossible or very cumbersome. I am aware of some of the 
limitations of the current DirectRunner implementation[1][2][3], however I was 
not sure if this odd behavior is expected.


[1][2][3]
https://stackoverflow.com/a/44765621
https://issues.apache.org/jira/browse/BEAM-1442
https://beam.apache.org/documentation/runners/direct/

Repro:
The simple script below blew up my laptop (MBP 2015) and had to terminate the 
process after 10 minutes or so (screenshots about high memory and CPU 
utilization are also attached).

{code}
from apache_beam.io import textio
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse

def run(argv=None):
 """Main entry point; defines and runs the wordcount pipeline."""
 parser = argparse.ArgumentParser()
 parser.add_argument('--input',
  dest='input',
  
default='/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv',
  help='Input file to process.')
 known_args, pipeline_args = parser.parse_known_args(argv)
 pipeline_options = PipelineOptions(pipeline_args)
 pipeline_options.view_as(SetupOptions).save_main_session = True
 pipeline = beam.Pipeline(options=pipeline_options)
 raw_data = (
   pipeline
   | 'ReadTrainData' >> textio.ReadFromText(known_args.input, 
skip_header_lines=1)
   | 'Map' >> beam.Map(lambda line: line.lower())
 )
 result = pipeline.run()
 result.wait_until_finish()
 print(raw_data)

if __name__ == '__main__':
  run()
{code}

Example dataset:  
https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009

for comparison: 

{code}
lines = [line.lower() for line in 
open('/Users/pk11/Desktop/Motor_Vehicle_Crashes_-_Vehicle_Information__Three_Year_Window.csv')]
print(len(lines))
{code}

this vanilla python script runs on the same hardware and dataset in 0m4.909s. 


> Python DirectRunner is unusable with input files in the 100-250MB range
> ---
>
> Key: BEAM-2815
> URL: https://issues.apache.org/jira/browse/BEAM-2815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions:

[jira] [Created] (BEAM-2814) test_as_singleton_with_different_defaults test is flaky

2017-08-28 Thread Ahmet Altay (JIRA)

Ahmet Altay created BEAM-2814:
-

 Summary: test_as_singleton_with_different_defaults test is flaky
 Key: BEAM-2814
 URL: https://issues.apache.org/jira/browse/BEAM-2814
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Ahmet Altay
Priority: Critical


{{test_as_singleton_with_different_defaults}} is flaky and failed in the post 
commit test 3013, but there is no related change to trigger this.

https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3013/consoleFull
(https://pantheon.corp.google.com/dataflow/job/2017-08-28_11_08_56-17324181904913254210?project=apache-beam-testing)

Dataflow error form the console:
  (b4d390f9f9e033b4): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 582, in do_work
work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
line 166, in execute
op.start()
  File "apache_beam/runners/worker/operations.py", line 294, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10607)
def start(self):
  File "apache_beam/runners/worker/operations.py", line 295, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10501)
with self.scoped_start_state:
  File "apache_beam/runners/worker/operations.py", line 323, in 
apache_beam.runners.worker.operations.DoOperation.start 
(apache_beam/runners/worker/operations.c:10322)
self.dofn_runner = common.DoFnRunner(
  File "apache_beam/runners/common.py", line 378, in 
apache_beam.runners.common.DoFnRunner.__init__ 
(apache_beam/runners/common.c:10018)
self.do_fn_invoker = DoFnInvoker.create_invoker(
  File "apache_beam/runners/common.py", line 154, in 
apache_beam.runners.common.DoFnInvoker.create_invoker 
(apache_beam/runners/common.c:5212)
return PerWindowInvoker(
  File "apache_beam/runners/common.py", line 219, in 
apache_beam.runners.common.PerWindowInvoker.__init__ 
(apache_beam/runners/common.c:7109)
input_args, input_kwargs, [si[global_window] for si in side_inputs])
  File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/sideinputs.py", 
line 63, in __getitem__
_FilteringIterable(self._iterable, target_window), self._view_options)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.py", line 
332, in _from_runtime_iterable
'PCollection with more than one element accessed as '
ValueError: PCollection with more than one element accessed as a singleton view.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (BEAM-2813) error: option --test-pipeline-options not recognized

2017-08-28 Thread Ahmet Altay (JIRA)

Ahmet Altay created BEAM-2813:
-

 Summary: error: option --test-pipeline-options not recognized
 Key: BEAM-2813
 URL: https://issues.apache.org/jira/browse/BEAM-2813
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Ahmet Altay
Assignee: Mark Liu


Python post commits 3004 to 3008 (all 5) failed with this error, but somehow 
fixed in 3009. Mark do you know what might be causing this?

https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3004/
https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3009/

The error is:

# Run ValidatesRunner tests on Google Cloud Dataflow service
echo ">>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS"
>>> RUNNING DATAFLOW RUNNER VALIDATESRUNNER TESTS
python setup.py nosetests \
  --attr ValidatesRunner \
  --nocapture \
  --processes=4 \
  --process-timeout=900 \
  --test-pipeline-options=" \
--runner=TestDataflowRunner \
--project=$PROJECT \
--staging_location=$GCS_LOCATION/staging-validatesrunner-test \
--temp_location=$GCS_LOCATION/temp-validatesrunner-test \
--sdk_location=$SDK_LOCATION \
--requirements_file=postcommit_requirements.txt \
--num_workers=1"
/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/local/lib/python2.7/site-packages/setuptools/dist.py:341:
 UserWarning: Normalizing '2.2.0.dev' to '2.2.0.dev0'
  normalized_version,
/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/.eggs/nose-1.3.7-py2.7.egg/nose/plugins/manager.py:395:
 RuntimeWarning: Unable to load plugin beam_test_plugin = 
test_config:BeamTestPlugin: (pyasn1 0.3.3 
(/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/lib/python2.7/site-packages),
 Requirement.parse('pyasn1==0.3.2'), set(['pyasn1-modules']))
  RuntimeWarning)
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: setup.py --help [cmd1 cmd2 ...]
   or: setup.py --help-commands
   or: setup.py cmd --help

error: option --test-pipeline-options not recognized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] beam pull request #3775: Use the same termination logic in different places ...

2017-08-28 Thread aaltay

GitHub user aaltay opened a pull request:

https://github.com/apache/beam/pull/3775

Use the same termination logic in different places in DataflowRunner

Use the same termination check for `wait_until_finish` and 
`_is_in_terminal_state`. Note that this only affects an `assert` with a TODO to 
improve it.

The change is required for having clean fail messages instead of this 
assert for cases similar to: 
https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Python_Verify/3013/consoleFull

(https://pantheon.corp.google.com/dataflow/job/2017-08-28_11_08_56-17324181904913254210?project=apache-beam-testing)

R: @charlesccychen 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aaltay/beam cancel

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3775


commit 4460261f5ea7e5bf6bec682b8c0977d3623193b5
Author: Ahmet Altay 
Date:   2017-08-28T20:32:32Z

Use the same termination logic in different places




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] beam pull request #3774: Test show stall

2017-08-28 Thread mariapython

GitHub user mariapython opened a pull request:

https://github.com/apache/beam/pull/3774

Test show stall




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mariapython/incubator-beam retry_gbk2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3774


commit 2da2df82940ea35fd35dd68ed08b583e4c7a330e
Author: Maria Garcia Herrero 
Date:   2017-08-28T19:26:49Z

Test show stall




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Dataflow #3857

2017-08-28 Thread Apache Jenkins Server

See

[jira] [Commented] (BEAM-2484) Add streaming examples for Python

2017-08-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144221#comment-16144221
 ] 

ASF GitHub Bot commented on BEAM-2484:
--

Github user davidcavazos closed the pull request at:

https://github.com/apache/beam/pull/3756


> Add streaming examples for Python
> -
>
> Key: BEAM-2484
> URL: https://issues.apache.org/jira/browse/BEAM-2484
> Project: Beam
>  Issue Type: Improvement
>  Components: examples-python
>Reporter: David Cavazos
>Priority: Trivial
>
> Port the mobile gaming example to Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] beam pull request #3756: Tests for python gaming examples

2017-08-28 Thread davidcavazos

Github user davidcavazos closed the pull request at:

https://github.com/apache/beam/pull/3756


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Jenkins build became unstable: beam_PostCommit_Java_MavenInstall #4665

2017-08-28 Thread Apache Jenkins Server

See

Build failed in Jenkins: beam_PostCommit_Python_Verify #3013

2017-08-28 Thread Apache Jenkins Server

See 


Changes:

[lcwik] [BEAM-1347] Add a BagUserState implementation over the BeamFnStateClient

[lcwik] Initialize the Coder in DecodeAndEmitDoFn

[robertwb] Wrap unknown coders in LengthPrefixCoder.

--
[...truncated 685.95 KB...]
}
  ], 
  "is_pair_like": true
}
  ], 
  "is_stream_like": true
}
  ], 
  "is_pair_like": true
}, 
{
  "@type": "kind:global_window"
}
  ], 
  "is_wrapper": true
}, 
"output_name": "out", 
"user_name": "assert_that/Group/GroupByKey.out"
  }
], 
"parallel_input": {
  "@type": "OutputReference", 
  "output_name": "out", 
  "step_name": "s11"
}, 
"serialized_fn": 
"%0AZ%22X%0A%1Dref_Coder_GlobalWindowCoder_1%127%0A5%0A3%0A1urn%3Abeam%3Acoders%3Aurn%3Abeam%3Acoders%3Aglobal_window%3A0.1jJ%0A%25%0A%23%0A%21beam%3Awindowfn%3Aglobal_windows%3Av0.1%1A%1Dref_Coder_GlobalWindowCoder_1%22%02%3A%00",
 
"user_name": "assert_that/Group/GroupByKey"
  }
}, 
{
  "kind": "ParallelDo", 
  "name": "s13", 
  "properties": {
"display_data": [
  {
"key": "fn", 
"label": "Transform Function", 
"namespace": "apache_beam.transforms.core.CallableWrapperDoFn", 
"type": "STRING", 
"value": "_merge_tagged_vals_under_key"
  }, 
  {
"key": "fn", 
"label": "Transform Function", 
"namespace": "apache_beam.transforms.core.ParDo", 
"shortValue": "CallableWrapperDoFn", 
"type": "STRING", 
"value": "apache_beam.transforms.core.CallableWrapperDoFn"
  }
], 
"non_parallel_inputs": {}, 
"output_info": [
  {
"encoding": {
  "@type": "kind:windowed_value", 
  "component_encodings": [
{
  "@type": 
"FastPrimitivesCoder$eNprYEpOLEhMzkiNT0pNzNVLzk9JLSqGUlxuicUlAUWZuZklmWWpxc4gQa5CBs3GQsbaQqZQ/vi0xJycpMTk7Hiw+kJmPEYFZCZn56RCjWABGsFaW8iWVJykBwDlGS3/",
 
  "component_encodings": [
{
  "@type": 
"FastPrimitivesCoder$eNprYEpOLEhMzkiNT0pNzNVLzk9JLSqGUlxuicUlAUWZuZklmWWpxc4gQa5CBs3GQsbaQqZQ/vi0xJycpMTk7Hiw+kJmPEYFZCZn56RCjWABGsFaW8iWVJykBwDlGS3/",
 
  "component_encodings": []
}, 
{
  "@type": 
"FastPrimitivesCoder$eNprYEpOLEhMzkiNT0pNzNVLzk9JLSqGUlxuicUlAUWZuZklmWWpxc4gQa5CBs3GQsbaQqZQ/vi0xJycpMTk7Hiw+kJmPEYFZCZn56RCjWABGsFaW8iWVJykBwDlGS3/",
 
  "component_encodings": []
}
  ], 
  "is_pair_like": true
}, 
{
  "@type": "kind:global_window"
}
  ], 
  "is_wrapper": true
}, 
"output_name": "out", 
"user_name": 
"assert_that/Group/Map(_merge_tagged_vals_under_key).out"
  }
], 
"parallel_input": {
  "@type": "OutputReference", 
  "output_name": "out", 
  "step_name": "s12"
}, 
"serialized_fn": "", 
"user_name": "assert_that/Group/Map(_merge_tagged_vals_under_key)"
  }
}, 
{
  "kind": "ParallelDo", 
  "name": "s14", 
  "properties": {
"display_data": [
  {
"key": "fn", 
"label": "Transform Function", 
"namespace": "apache_beam.transforms.core.CallableWrapperDoFn", 
"type": "STRING", 
"value": ""
  }, 
  {
"key": "fn", 
"label": "Transform Function", 
"namespace": "apache_beam.transforms.core.ParDo", 
"shortValue": "CallableWrapperDoFn", 
"type": "STRING", 
"value": "apache_beam.transforms.core.CallableWrapperDoFn"
  }
], 
"non_parallel_inputs": {}, 
"output_info": [
  {
"encoding": {
  "@type": "kind:windowed_value", 
  "component_encodings": [
{
  "@type": 
"FastPrimitivesCoder$eNprYEpOLEhMzkiNT0pNzNVLzk9JLSqGUlxuicUlAUWZuZklmWWpxc4gQa5CBs3GQsbaQqZQ/vi0xJycpMTk7Hiw+kJmPEYFZCZn56RCjWABGsFaW8iWVJykBwDlGS3/",
 
  "component_encodings": [
{
  "@type":

Jenkins build is back to stable : beam_PostCommit_Java_MavenInstall #4663

2017-08-28 Thread Apache Jenkins Server

See

Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Dataflow #3856

2017-08-28 Thread Apache Jenkins Server

See

[jira] [Commented] (BEAM-2806) support View.CreatePCollectionView in FlinkRunner

2017-08-28 Thread Xu Mingmin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144121#comment-16144121
 ] 

Xu Mingmin commented on BEAM-2806:
--

I'm running on DSL_SQL branch, --sync-up from master at July 12 .

It works with {{PCollection streamingPCollection = p.apply("src1", 
Create.of("1", "2"));}}, but not {{PCollection streamingPCollection 
= sojEventTable.buildIOReader(p);}}.

The first one is an in-memory bounded source, and the later is an unbounded 
source. Will that be the problem?


> support View.CreatePCollectionView in FlinkRunner
> -
>
> Key: BEAM-2806
> URL: https://issues.apache.org/jira/browse/BEAM-2806
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Xu Mingmin
>Assignee: Aljoscha Krettek
>
> Beam version: 2.2.0-SNAPSHOT
> Here's the code
> {code}
> PCollectionView> rowsView = rightRows
> .apply(View.asMultimap());
> {code}
> And exception when running with {{FlinkRunner}}:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: The 
> transform View.CreatePCollectionView is currently not supported.
>   at 
> org.apache.beam.runners.flink.FlinkStreamingPipelineTranslator.visitPrimitiveTransform(FlinkStreamingPipelineTranslator.java:113)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:594)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:586)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:586)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:268)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:202)
>   at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:440)
>   at 
> org.apache.beam.runners.flink.FlinkPipelineTranslator.translate(FlinkPipelineTranslator.java:38)
>   at 
> org.apache.beam.runners.flink.FlinkStreamingPipelineTranslator.translate(FlinkStreamingPipelineTranslator.java:69)
>   at 
> org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.translate(FlinkPipelineExecutionEnvironment.java:104)
>   at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:113)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2457) Error: "Unable to find registrar for hdfs" - need to prevent/improve error message

2017-08-28 Thread Flavio Fiszman (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144110#comment-16144110
 ] 

Flavio Fiszman commented on BEAM-2457:
--

I'm not sure about the error in 2.2.0-SNAPSHOT,
but for the "Unable to find registrar for hdfs" error, there are these related 
questions that have solved other people's issues:

https://stackoverflow.com/questions/26958865/no-filesystem-for-scheme-hdfs/28135140#28135140
https://stackoverflow.com/questions/44365545/apache-beam-unable-to-find-registrar-for-gs
https://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file
https://issues.apache.org/jira/projects/BEAM/issues/BEAM-2429?filter=allissues

Let me know if that helps you, thanks!

> Error: "Unable to find registrar for hdfs" - need to prevent/improve error 
> message
> --
>
> Key: BEAM-2457
> URL: https://issues.apache.org/jira/browse/BEAM-2457
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Affects Versions: 2.0.0
>Reporter: Stephen Sisk
>Assignee: Flavio Fiszman
>
> I've noticed a number of user reports where jobs are failing with the error 
> message "Unable to find registrar for hdfs": 
> * 
> https://stackoverflow.com/questions/44497662/apache-beamunable-to-find-registrar-for-hdfs/44508533?noredirect=1#comment76026835_44508533
> * 
> https://lists.apache.org/thread.html/144c384e54a141646fcbe854226bb3668da091c5dc7fa2d471626e9b@%3Cuser.beam.apache.org%3E
> * 
> https://lists.apache.org/thread.html/e4d5ac744367f9d036a1f776bba31b9c4fe377d8f11a4b530be9f829@%3Cuser.beam.apache.org%3E
>  
> This isn't too many reports, but it is the only time I can recall so many 
> users reporting the same error message in a such a short amount of time. 
> We believe the problem is one of two things: 
> 1) bad uber jar creation
> 2) incorrect HDFS configuration
> However, it's highly possible this could have some other root cause. 
> It seems like it'd be useful to:
> 1) Follow up with the above reports to see if they've resolved the issue, and 
> if so what fixed it. There may be another root cause out there.
> 2) Improve the error message to include more information about how to resolve 
> it
> 3) See if we can improve detection of the error cases to give more specific 
> information (specifically, if HDFS is miconfigured, can we detect that 
> somehow and tell the user exactly that?)
> 4) update documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Jenkins build is still unstable: beam_PostCommit_Java_MavenInstall #4662

2017-08-28 Thread Apache Jenkins Server

See

Jenkins build is back to normal : beam_PostCommit_Python_Verify #3012

2017-08-28 Thread Apache Jenkins Server

See

[jira] [Commented] (BEAM-2792) Populate All Runner API Components from the Python SDK

2017-08-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144057#comment-16144057
 ] 

ASF GitHub Bot commented on BEAM-2792:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3764


> Populate All Runner API Components from the Python SDK
> --
>
> Key: BEAM-2792
> URL: https://issues.apache.org/jira/browse/BEAM-2792
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Robert Bradshaw
>Assignee: Robert Bradshaw
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] beam pull request #3764: [BEAM-2792] Wrap unknown coders in LengthPrefixCode...

2017-08-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3764


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[2/2] beam git commit: Wrap unknown coders in LengthPrefixCoder.

2017-08-28 Thread robertwb

Wrap unknown coders in LengthPrefixCoder.


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/08a44874
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/08a44874
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/08a44874

Branch: refs/heads/master
Commit: 08a448743e3b53e055d0ccf1983b5d128c5c0692
Parents: e6d5e08
Author: Robert Bradshaw 
Authored: Thu Aug 24 11:01:20 2017 -0700
Committer: Robert Bradshaw 
Committed: Mon Aug 28 10:03:52 2017 -0700

--
 sdks/python/apache_beam/coders/coders.py| 10 ++
 .../runners/portability/fn_api_runner.py| 99 ++--
 2 files changed, 100 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/08a44874/sdks/python/apache_beam/coders/coders.py
--
diff --git a/sdks/python/apache_beam/coders/coders.py 
b/sdks/python/apache_beam/coders/coders.py
index e204369..10fb07b 100644
--- a/sdks/python/apache_beam/coders/coders.py
+++ b/sdks/python/apache_beam/coders/coders.py
@@ -707,6 +707,16 @@ class TupleCoder(FastCoder):
   def __hash__(self):
 return hash(self._coders)
 
+  def to_runner_api_parameter(self, context):
+if self.is_kv_coder():
+  return urns.KV_CODER, None, self.coders()
+else:
+  return super(TupleCoder, self).to_runner_api_parameter(context)
+
+  @Coder.register_urn(urns.KV_CODER, None)
+  def from_runner_api_parameter(unused_payload, components, unused_context):
+return TupleCoder(components)
+
 
 class TupleSequenceCoder(FastCoder):
   """Coder of homogeneous tuple objects."""

http://git-wip-us.apache.org/repos/asf/beam/blob/08a44874/sdks/python/apache_beam/runners/portability/fn_api_runner.py
--
diff --git a/sdks/python/apache_beam/runners/portability/fn_api_runner.py 
b/sdks/python/apache_beam/runners/portability/fn_api_runner.py
index 7c0c06f..c9b3d9a 100644
--- a/sdks/python/apache_beam/runners/portability/fn_api_runner.py
+++ b/sdks/python/apache_beam/runners/portability/fn_api_runner.py
@@ -122,7 +122,7 @@ OLDE_SOURCE_SPLITTABLE_DOFN_DATA = pickler.dumps(
 class _GroupingBuffer(object):
   """Used to accumulate groupded (shuffled) results."""
   def __init__(self, pre_grouped_coder, post_grouped_coder):
-self._key_coder = pre_grouped_coder.value_coder().key_coder()
+self._key_coder = pre_grouped_coder.key_coder()
 self._pre_grouped_coder = pre_grouped_coder
 self._post_grouped_coder = post_grouped_coder
 self._table = collections.defaultdict(list)
@@ -249,13 +249,80 @@ class 
FnApiRunner(maptask_executor_runner.MapTaskExecutorRunner):
 
 # Now define the "optimization" phases.
 
+safe_coders = {}
+
 def expand_gbk(stages):
   """Transforms each GBK into a write followed by a read.
   """
+  good_coder_urns = set(beam.coders.Coder._known_urns.keys()) - set([
+  urns.PICKLED_CODER])
+  coders = pipeline_components.coders
+
+  for coder_id, coder_proto in coders.items():
+if coder_proto.spec.spec.urn == urns.BYTES_CODER:
+  bytes_coder_id = coder_id
+  break
+  else:
+bytes_coder_id = unique_name(coders, 'bytes_coder')
+pipeline_components.coders[bytes_coder_id].CopyFrom(
+beam.coders.BytesCoder().to_runner_api(None))
+
+  coder_substitutions = {}
+
+  def wrap_unknown_coders(coder_id, with_bytes):
+if (coder_id, with_bytes) not in coder_substitutions:
+  wrapped_coder_id = None
+  coder_proto = coders[coder_id]
+  if coder_proto.spec.spec.urn == urns.LENGTH_PREFIX_CODER:
+coder_substitutions[coder_id, with_bytes] = (
+bytes_coder_id if with_bytes else coder_id)
+  elif coder_proto.spec.spec.urn in good_coder_urns:
+wrapped_components = [wrap_unknown_coders(c, with_bytes)
+  for c in coder_proto.component_coder_ids]
+if wrapped_components == list(coder_proto.component_coder_ids):
+  # Use as is.
+  coder_substitutions[coder_id, with_bytes] = coder_id
+else:
+  wrapped_coder_id = unique_name(
+  coders,
+  coder_id + ("_bytes" if with_bytes else "_len_prefix"))
+  coders[wrapped_coder_id].CopyFrom(coder_proto)
+  coders[wrapped_coder_id].component_coder_ids[:] = [
+  wrap_unknown_coders(c, with_bytes)
+  for c in coder_proto.component_coder_ids]
+  coder_substitutions[coder_id, with_bytes] = wrapped_coder_id
+  else:
+# Not a known coder.
+if with_bytes:
+

[GitHub] beam pull request #3769: Initialize the Coder in DecodeAndEmitDoFn

2017-08-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3769


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[2/2] beam git commit: Initialize the Coder in DecodeAndEmitDoFn

2017-08-28 Thread lcwik

Initialize the Coder in DecodeAndEmitDoFn

This closes #3769


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/e6d5e088
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/e6d5e088
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/e6d5e088

Branch: refs/heads/master
Commit: e6d5e0887643f341ce59d316d381889eb975dd39
Parents: ba5c407 0cf4543
Author: Luke Cwik 
Authored: Mon Aug 28 10:01:38 2017 -0700
Committer: Luke Cwik 
Committed: Mon Aug 28 10:01:38 2017 -0700

--
 .../main/java/org/apache/beam/runners/dataflow/DataflowRunner.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--

[1/2] beam git commit: Initialize the Coder in DecodeAndEmitDoFn

2017-08-28 Thread lcwik

Repository: beam
Updated Branches:
  refs/heads/master ba5c4071d -> e6d5e0887


Initialize the Coder in DecodeAndEmitDoFn

Ensure that the coder is available before it is used


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/0cf45438
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/0cf45438
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/0cf45438

Branch: refs/heads/master
Commit: 0cf454389129fbbe43d03ac3b26368e6d477d126
Parents: ba5c407
Author: Thomas Groh 
Authored: Fri Aug 25 16:58:31 2017 -0700
Committer: Luke Cwik 
Committed: Mon Aug 28 09:59:03 2017 -0700

--
 .../main/java/org/apache/beam/runners/dataflow/DataflowRunner.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/0cf45438/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
--
diff --git 
a/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
 
b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
index 496681e..afccfca 100644
--- 
a/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
+++ 
b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
@@ -1102,7 +1102,7 @@ public class DataflowRunner extends 
PipelineRunner {
   @ProcessElement
   public void processElement(ProcessContext context) throws IOException {
 for (byte[] element : elements) {
-  context.output(CoderUtils.decodeFromByteArray(coder, element));
+  context.output(CoderUtils.decodeFromByteArray(getCoder(), element));
 }
   }
 }

[jira] [Commented] (BEAM-1347) Basic Java harness capable of understanding process bundle tasks and sending data over the Fn Api

2017-08-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144023#comment-16144023
 ] 

ASF GitHub Bot commented on BEAM-1347:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3760


> Basic Java harness capable of understanding process bundle tasks and sending 
> data over the Fn Api
> -
>
> Key: BEAM-1347
> URL: https://issues.apache.org/jira/browse/BEAM-1347
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>
> Create a basic Java harness capable of understanding process bundle requests 
> and able to stream data over the Fn Api.
> Overview: https://s.apache.org/beam-fn-api



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[2/2] beam git commit: [BEAM-1347] Add a BagUserState implementation over the BeamFnStateClient

2017-08-28 Thread lcwik

[BEAM-1347] Add a BagUserState implementation over the BeamFnStateClient

This closes #3760


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/ba5c4071
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/ba5c4071
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/ba5c4071

Branch: refs/heads/master
Commit: ba5c4071dfa1831262bc01aa33f06ac0e5561484
Parents: f634aec 8d36a26
Author: Luke Cwik 
Authored: Mon Aug 28 09:54:38 2017 -0700
Committer: Luke Cwik 
Committed: Mon Aug 28 09:54:38 2017 -0700

--
 .../beam/fn/harness/state/BagUserState.java | 121 +++
 .../state/LazyCachingIteratorToIterable.java|  72 +++
 .../beam/fn/harness/state/BagUserStateTest.java | 106 
 .../fn/harness/state/FakeBeamFnStateClient.java | 110 +
 .../LazyCachingIteratorToIterableTest.java  |  76 
 5 files changed, 485 insertions(+)
--

[1/2] beam git commit: [BEAM-1347] Add a BagUserState implementation over the BeamFnStateClient

2017-08-28 Thread lcwik

Repository: beam
Updated Branches:
  refs/heads/master f634aecbb -> ba5c4071d


[BEAM-1347] Add a BagUserState implementation over the BeamFnStateClient


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/8d36a261
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/8d36a261
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/8d36a261

Branch: refs/heads/master
Commit: 8d36a261d4e8c6569e9036a27d45c00daccd908b
Parents: 20d88db
Author: Luke Cwik 
Authored: Thu Aug 24 18:34:47 2017 -0700
Committer: Luke Cwik 
Committed: Fri Aug 25 08:53:27 2017 -0700

--
 .../beam/fn/harness/state/BagUserState.java | 121 +++
 .../state/LazyCachingIteratorToIterable.java|  72 +++
 .../beam/fn/harness/state/BagUserStateTest.java | 106 
 .../fn/harness/state/FakeBeamFnStateClient.java | 110 +
 .../LazyCachingIteratorToIterableTest.java  |  76 
 5 files changed, 485 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/8d36a261/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java
--
diff --git 
a/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java
 
b/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java
new file mode 100644
index 000..2d7f0c8
--- /dev/null
+++ 
b/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.fn.harness.state;
+
+import static com.google.common.base.Preconditions.checkState;
+
+import com.google.common.collect.Iterables;
+import com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.CompletableFuture;
+import java.util.function.Supplier;
+import org.apache.beam.fn.harness.stream.DataStreams;
+import org.apache.beam.fn.v1.BeamFnApi.StateAppendRequest;
+import org.apache.beam.fn.v1.BeamFnApi.StateClearRequest;
+import org.apache.beam.fn.v1.BeamFnApi.StateRequest.Builder;
+import org.apache.beam.sdk.coders.Coder;
+
+/**
+ * An implementation of a bag user state that utilizes the Beam Fn State API 
to fetch, clear
+ * and persist values.
+ *
+ * Calling {@link #asyncClose()} schedules any required persistence 
changes. This object should
+ * no longer be used after it is closed.
+ *
+ * TODO: Move to an async persist model where persistence is signalled 
based upon cache
+ * memory pressure and its need to flush.
+ *
+ * TODO: Support block level caching and prefetch.
+ */
+public class BagUserState {
+  private final BeamFnStateClient beamFnStateClient;
+  private final String stateId;
+  private final Coder coder;
+  private final Supplier partialRequestSupplier;
+  private Iterable oldValues;
+  private ArrayList newValues;
+  private List unmodifiableNewValues;
+  private boolean isClosed;
+
+  public BagUserState(
+  BeamFnStateClient beamFnStateClient,
+  String stateId,
+  Coder coder,
+  Supplier partialRequestSupplier) {
+this.beamFnStateClient = beamFnStateClient;
+this.stateId = stateId;
+this.coder = coder;
+this.partialRequestSupplier = partialRequestSupplier;
+this.oldValues = new LazyCachingIteratorToIterable<>(
+new DataStreams.DataStreamDecoder(coder,
+DataStreams.inbound(
+StateFetchingIterators.usingPartialRequestWithStateKey(
+beamFnStateClient,
+partialRequestSupplier;
+this.newValues = new ArrayList<>();
+this.unmodifiableNewValues = Collections.unmodifiableList(newValues);
+  }
+
+  public Iterable get() {
+checkState(!isClosed,
+"Bag user state is no longer usable because it is closed for %s", 
stateId);
+// If we were cleared we should disregard old values.
+if (oldValues == null) {
+

Jenkins build is back to stable : beam_PostCommit_Java_ValidatesRunner_Spark #2933

2017-08-28 Thread Apache Jenkins Server

See

[GitHub] beam pull request #3773: Small tweak to View.asList javadoc

2017-08-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3773


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[2/2] beam git commit: Small tweak to View.asList javadoc

2017-08-28 Thread lcwik

Small tweak to View.asList javadoc

This closes #3773


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/f634aecb
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/f634aecb
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/f634aecb

Branch: refs/heads/master
Commit: f634aecbb1271b5a07557ab879c3858dec1345ed
Parents: d50c964 dba5e5c
Author: Luke Cwik 
Authored: Mon Aug 28 09:18:27 2017 -0700
Committer: Luke Cwik 
Committed: Mon Aug 28 09:18:27 2017 -0700

--
 .../core/src/main/java/org/apache/beam/sdk/transforms/View.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--

[1/2] beam git commit: Small tweak to View.asList javadoc

2017-08-28 Thread lcwik

Repository: beam
Updated Branches:
  refs/heads/master d50c9644e -> f634aecbb


Small tweak to View.asList javadoc

This may help clarify

https://lists.apache.org/thread.html/cd9bd1ae4b6945cd78e04b3baa7628bd43071c443a752acbb83d388d@%3Cdev.beam.apache.org%3E


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/dba5e5ca
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/dba5e5ca
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/dba5e5ca

Branch: refs/heads/master
Commit: dba5e5ca3779f31e407e18a7d22915491b071fe9
Parents: d50c964
Author: wtanaka.com 
Authored: Sun Aug 27 10:23:21 2017 -1000
Committer: Luke Cwik 
Committed: Mon Aug 28 09:18:05 2017 -0700

--
 .../core/src/main/java/org/apache/beam/sdk/transforms/View.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/dba5e5ca/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/View.java
--
diff --git 
a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/View.java 
b/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/View.java
index f6f3af5..e463d46 100644
--- a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/View.java
+++ b/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/View.java
@@ -174,7 +174,7 @@ public class View {
* {@link PCollectionView} mapping each window to a {@link List} containing
* all of the elements in the window.
*
-   * The resulting list is required to fit in memory.
+   * Unlike with {@link #asIterable}, the resulting list is required to fit 
in memory.
*/
   public static  AsList asList() {
 return new AsList<>();

Build failed in Jenkins: beam_PostCommit_Python_Verify #3011

2017-08-28 Thread Apache Jenkins Server

See 


--
[...truncated 619.06 KB...]
  File was already downloaded /tmp/dataflow-requirements-cache/mock-2.0.0.tar.gz
Collecting setuptools (from pyhamcrest->-r postcommit_requirements.txt (line 1))
  File was already downloaded 
/tmp/dataflow-requirements-cache/setuptools-36.3.0.zip
test_as_list_and_as_dict_side_inputs 
(apache_beam.transforms.sideinputs_test.SideInputsTest) ... ok
:135:
 UserWarning: Using fallback coder for typehint: Union[Tuple[NoneType, 
Tuple[Any, NoneType]], Tuple[NoneType, Tuple[Any, Tuple[int, str.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
Collecting six (from pyhamcrest->-r postcommit_requirements.txt (line 1))
:135:
 UserWarning: Using fallback coder for typehint: Union[Tuple[Any, NoneType], 
Tuple[Any, Tuple[int, str]]].
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
  File was already downloaded /tmp/dataflow-requirements-cache/six-1.10.0.tar.gz
Collecting funcsigs>=1 (from mock->-r postcommit_requirements.txt (line 2))
  File was already downloaded 
/tmp/dataflow-requirements-cache/funcsigs-1.0.2.tar.gz
Collecting pbr>=0.11 (from mock->-r postcommit_requirements.txt (line 2))
  File was already downloaded /tmp/dataflow-requirements-cache/pbr-3.1.1.tar.gz
DEPRECATION: pip install --download has been deprecated and will be removed in 
the future. Pip now has a download command that should be used instead.
Collecting pyhamcrest (from -r postcommit_requirements.txt (line 1))
  File was already downloaded 
/tmp/dataflow-requirements-cache/PyHamcrest-1.9.0.tar.gz
Successfully downloaded pyhamcrest mock setuptools six funcsigs pbr
Collecting mock (from -r postcommit_requirements.txt (line 2))
  File was already downloaded /tmp/dataflow-requirements-cache/mock-2.0.0.tar.gz
Collecting setuptools (from pyhamcrest->-r postcommit_requirements.txt (line 1))
  File was already downloaded 
/tmp/dataflow-requirements-cache/setuptools-36.3.0.zip
Collecting six (from pyhamcrest->-r postcommit_requirements.txt (line 1))
  File was already downloaded /tmp/dataflow-requirements-cache/six-1.10.0.tar.gz
Collecting funcsigs>=1 (from mock->-r postcommit_requirements.txt (line 2))
  File was already downloaded 
/tmp/dataflow-requirements-cache/funcsigs-1.0.2.tar.gz
Collecting pbr>=0.11 (from mock->-r postcommit_requirements.txt (line 2))
  File was already downloaded /tmp/dataflow-requirements-cache/pbr-3.1.1.tar.gz
Successfully downloaded pyhamcrest mock setuptools six funcsigs pbr
:135:
 UserWarning: Using fallback coder for typehint: Union[Tuple[NoneType, 
Tuple[Any, Any]], Tuple[NoneType, Tuple[Any, NoneType]]].
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
test_as_singleton_without_unique_labels 
(apache_beam.transforms.sideinputs_test.SideInputsTest) ... ok
:135:
 UserWarning: Using fallback coder for typehint: Union[Tuple[Any, Any], 
Tuple[Any, NoneType]].
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
DEPRECATION: pip install --download has been deprecated and will be removed in 
the future. Pip now has a download command that should be used instead.
Collecting pyhamcrest (from -r postcommit_requirements.txt (line 1))
  File was already downloaded 
/tmp/dataflow-requirements-cache/PyHamcrest-1.9.0.tar.gz
Collecting mock (from -r postcommit_requirements.txt (line 2))
  File was already downloaded /tmp/dataflow-requirements-cache/mock-2.0.0.tar.gz
Collecting setuptools (from pyhamcrest->-r postcommit_requirements.txt (line 1))
  File was already downloaded 
/tmp/dataflow-requirements-cache/setuptools-36.3.0.zip
Collecting six (from pyhamcrest->-r postcommit_requirements.txt (line 1))
  File was already downloaded /tmp/dataflow-requirements-cache/six-1.10.0.tar.gz
Collecting funcsigs>=1 (from mock->-r postcommit_requirements.txt (line 2))
  File was already downloaded 
/tmp/dataflow-requirements-cache/funcsigs-1.0.2.tar.gz
Collecting pbr>=0.11 (from mock->-r postcommit_requirements.txt (line 2))
  File was already downloaded /tmp/dataflow-requirements-cache/pbr-3.1.1.tar.gz
Successfully downloaded pyhamcrest mock setuptools six funcsigs pbr
test_as_singleton_with_different_defaults 
(apache_beam.transforms.sideinputs_test.SideInputsTest) ... ok
:135:
 UserWarning: Using fallback coder for typehint: Union[Tuple[NoneType, 
Tuple[Any, Any]],

[jira] [Commented] (BEAM-2802) TextIO should allow specifying a custom delimiter

2017-08-28 Thread Eugene Kirpichov (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143916#comment-16143916
 ] 

Eugene Kirpichov commented on BEAM-2802:


Hmm, I have a hard time thinking why somebody would think it's a good idea to 
store their data in a file format like this, that is not readable by a human 
and has no library support - especially the 1-line text file. But oh well.

This feature feels very exotic to me (I've never heard of file formats like 
this) and I'm still hesitant to add it to TextIO directly, because this is no 
longer a text file in the conventional sense. Would you consider creating a 
separate IO for this? Or - are such file formats sufficiently common to merit 
being included in the Beam SDK at all? (the other option would be for you to 
develop it and ship to your clients separately from Beam SDK)

> TextIO should allow specifying a custom delimiter
> -
>
> Key: BEAM-2802
> URL: https://issues.apache.org/jira/browse/BEAM-2802
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: Minor
>
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a 
> text file into PCollection elements. It might happen that a record is spread 
> across more than one line. In that case we should be able to specify a custom 
> record delimiter to be used in place of the default ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2806) support View.CreatePCollectionView in FlinkRunner

2017-08-28 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143897#comment-16143897
 ] 

Aljoscha Krettek commented on BEAM-2806:


I tried this on the Flink Runner (both batch and streaming):
{code}
   Pipeline p = Pipeline.create(options);

PCollection streamingPCollection = p.apply("src1", Create.of("1", 
"2"));
PCollection lkpPCollection = p.apply("src2", Create.of("1", "2", 
"3"));

final PCollectionView> lkpAsView = 
lkpPCollection
.apply(WithKeys.of(new 
SerializableFunction() {
  @Override
  public Integer apply(String input) {
return 0;
  }
}))
.apply(View.asMultimap());

PCollection ret = streamingPCollection.apply(
ParDo.of(new DoFn(){
  @ProcessElement public void processElement(ProcessContext 
context) {
String drvRow = context.element();
Map key2Rows = 
context.sideInput(lkpAsView);
int pageId = Integer.parseInt(drvRow);
if(key2Rows.get(pageId) != null){
  System.out.println("Record Pass: "+drvRow);
}
  }
}).withSideInputs(lkpAsView)
);

p.run().waitUntilFinish();
{code}

Note that I only replaced {{BeamRecord}} by {{String}}. This seems to work. Are 
you running this on the master branch or some other version? I checked with 
Beam 2.1.0.

> support View.CreatePCollectionView in FlinkRunner
> -
>
> Key: BEAM-2806
> URL: https://issues.apache.org/jira/browse/BEAM-2806
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Xu Mingmin
>Assignee: Aljoscha Krettek
>
> Beam version: 2.2.0-SNAPSHOT
> Here's the code
> {code}
> PCollectionView> rowsView = rightRows
> .apply(View.asMultimap());
> {code}
> And exception when running with {{FlinkRunner}}:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: The 
> transform View.CreatePCollectionView is currently not supported.
>   at 
> org.apache.beam.runners.flink.FlinkStreamingPipelineTranslator.visitPrimitiveTransform(FlinkStreamingPipelineTranslator.java:113)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:594)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:586)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:586)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:268)
>   at 
> org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:202)
>   at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:440)
>   at 
> org.apache.beam.runners.flink.FlinkPipelineTranslator.translate(FlinkPipelineTranslator.java:38)
>   at 
> org.apache.beam.runners.flink.FlinkStreamingPipelineTranslator.translate(FlinkStreamingPipelineTranslator.java:69)
>   at 
> org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.translate(FlinkPipelineExecutionEnvironment.java:104)
>   at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:113)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2457) Error: "Unable to find registrar for hdfs" - need to prevent/improve error message

2017-08-28 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143846#comment-16143846
 ] 

Aljoscha Krettek commented on BEAM-2457:


Is there any update on this? I have a jar file that I build from the quickstart 
that exhibits this problem. This is the output I get:
{code}
$ java -cp word-count-beam-0.1-DIRECT.jar org.apache.beam.examples.WordCount 
--runner=DirectRunner  --inputFile=hdfs:///tmp/wc-in  
--output=hdfs:///tmp/wc-out
Exception in thread "main" java.lang.IllegalStateException: Unable to find 
registrar for hdfs
at 
org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
at 
org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:517)
at 
org.apache.beam.sdk.io.FileBasedSink.convertToFileResourceIfPossible(FileBasedSink.java:204)
at org.apache.beam.sdk.io.TextIO$Write.to(TextIO.java:296)
at org.apache.beam.examples.WordCount.main(WordCount.java:182)
{code}

This is with Beam 2.1.0, the project was created from the Beam 2.1.0 examples 
archetype. I also tried this with the Flink Runner before and get the same 
results.

On Beam 2.2.0-SNAPSHOT I get this instead:
{code}
$ java -cp word-count-beam-22-0.1-DIRECT.jar org.apache.beam.examples.WordCount 
--runner=DirectRunner  --inputFile=hdfs:///tmp/wc-in  
--output=hdfs:///tmp/wc-out
Aug 28, 2017 2:46:34 PM org.apache.beam.sdk.io.FileBasedSource 
getEstimatedSizeBytes
INFO: Filepattern hdfs:///tmp/wc-in matched 0 files with total size 0
Aug 28, 2017 2:46:34 PM org.apache.beam.sdk.io.FileBasedSource split
INFO: Splitting filepattern hdfs:///tmp/wc-in into bundles of size 0 took 0 ms 
and produced 0 files and 0 bundles
Aug 28, 2017 2:46:35 PM org.apache.beam.sdk.io.WriteFiles 
finalizeForDestinationFillEmptyShards
INFO: Finalizing write operation 
TextWriteOperation{tempDirectory=/home/hadoop/hdfs:/tmp/.temp-beam-2017-08-240_14-46-34-1/,
 windowedWrites=false} for destination null num shards 0.
Aug 28, 2017 2:46:35 PM org.apache.beam.sdk.io.WriteFiles 
finalizeForDestinationFillEmptyShards
INFO: Creating 1 empty output shards in addition to 0 written for a total of 1 
for destination null.
{code}

i.e. it's writing to my local filesystem under the path {{hdfs:}}.

> Error: "Unable to find registrar for hdfs" - need to prevent/improve error 
> message
> --
>
> Key: BEAM-2457
> URL: https://issues.apache.org/jira/browse/BEAM-2457
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Affects Versions: 2.0.0
>Reporter: Stephen Sisk
>Assignee: Flavio Fiszman
>
> I've noticed a number of user reports where jobs are failing with the error 
> message "Unable to find registrar for hdfs": 
> * 
> https://stackoverflow.com/questions/44497662/apache-beamunable-to-find-registrar-for-hdfs/44508533?noredirect=1#comment76026835_44508533
> * 
> https://lists.apache.org/thread.html/144c384e54a141646fcbe854226bb3668da091c5dc7fa2d471626e9b@%3Cuser.beam.apache.org%3E
> * 
> https://lists.apache.org/thread.html/e4d5ac744367f9d036a1f776bba31b9c4fe377d8f11a4b530be9f829@%3Cuser.beam.apache.org%3E
>  
> This isn't too many reports, but it is the only time I can recall so many 
> users reporting the same error message in a such a short amount of time. 
> We believe the problem is one of two things: 
> 1) bad uber jar creation
> 2) incorrect HDFS configuration
> However, it's highly possible this could have some other root cause. 
> It seems like it'd be useful to:
> 1) Follow up with the above reports to see if they've resolved the issue, and 
> if so what fixed it. There may be another root cause out there.
> 2) Improve the error message to include more information about how to resolve 
> it
> 3) See if we can improve detection of the error cases to give more specific 
> information (specifically, if HDFS is miconfigured, can we detect that 
> somehow and tell the user exactly that?)
> 4) update documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Jenkins build is still unstable: beam_PostCommit_Java_MavenInstall #4661

2017-08-28 Thread Apache Jenkins Server

See

Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Dataflow #3855

2017-08-28 Thread Apache Jenkins Server

See

Jenkins build became unstable: beam_PostCommit_Java_ValidatesRunner_Spark #2932

2017-08-28 Thread Apache Jenkins Server

See

[jira] [Updated] (BEAM-2812) Dropped windows counters / log prints no longer working

2017-08-28 Thread Aviem Zur (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aviem Zur updated BEAM-2812:

Description: 
In https://github.com/apache/beam/pull/2838 aggregators were removed from Spark 
runner, this caused regression around dropped windows counters and logs.

{{CounterCell}} instances are created ad hoc instead of using the {{Metrics}} 
class static factory methods: 
[SparkGroupAlsoByWindowViaWindowSet.java#L213-L219|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L213-L219]
Context of where the metrics are reported isn't taken into account, and since 
these counters are being passed to a lazily evaluated iterator 
[SparkGroupAlsoByWindowViaWindowSet.java#L221-L223|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L221-L223]
 the subsequent code which looks at the counters is always looking at these 
counters immediately after initialization, before they are populated, so these 
prints will never happen since the conditional statements do not check on the 
right counters 
[SparkGroupAlsoByWindowViaWindowSet.java#L323-L333|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L323-L333].
What we want is these counts exposed as metrics as well as logs.

Additionally, {{org.apache.beam.runners.core.LateDataUtils#dropExpiredWindows}} 
now takes a {{CounterCell}} as a parameter, which is a class for metrics 
implementation and should generally not be used elsewhere (this is also 
mentioned in its Javadoc), we should look into changing this method to use 
something else and perhaps make {{CounterCell}} and similar classes package 
private (And change runner code which uses these to be in the same package).

  was:
In https://github.com/apache/beam/pull/2838 aggregators were removed from Spark 
runner, this caused regression around dropped windows counters and logs.

{{CounterCell}} instances are created ad hoc instead of using the {{Metrics}} 
class static factory methods: 
[SparkGroupAlsoByWindowViaWindowSet.java#L213-L219|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L213-L219]
Context of where the metrics are reported isn't taken into account, and since 
these counters are being passed to a lazily evaluated iterator 
[SparkGroupAlsoByWindowViaWindowSet.java#L221-L223|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L221-L223]
 the subsequent code which looks at the counters is always looking at these 
counters immediately after initialization, before they are populated, so these 
prints will never happen since the conditional statements do not check on the 
right counters 
[SparkGroupAlsoByWindowViaWindowSet.java#L323-L333|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L323-L333].

Additionally, {{org.apache.beam.runners.core.LateDataUtils#dropExpiredWindows}} 
now takes a {{CounterCell}} as a parameter, which is a class for metrics 
implementation and should generally not be used elsewhere (this is also 
mentioned in its Javadoc), we should look into changing this method to use 
something else and perhaps make {{CounterCell}} and similar classes package 
private (And change runner code which uses these to be in the same package).


> Dropped windows counters / log prints no longer working
> ---
>
> Key: BEAM-2812
> URL: https://issues.apache.org/jira/browse/BEAM-2812
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Aviem Zur
>Assignee: Amit Sela
>
> In https://github.com/apache/beam/pull/2838 aggregators were removed from 
> Spark runner, this caused regression around dropped windows counters and logs.
> {{CounterCell}} instances are created ad hoc instead of using the {{Metrics}} 
> class static factory methods: 
> [SparkGroupAlsoByWindowViaWindowSet.java#L213-L219|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L213-L219]
> Context of where the metrics are reported isn't taken into account, and since 
> these counters are being passed to a lazily evaluated iterator 
>

[jira] [Created] (BEAM-2812) Dropped windows counters / log prints no longer working

2017-08-28 Thread Aviem Zur (JIRA)

Aviem Zur created BEAM-2812:
---

 Summary: Dropped windows counters / log prints no longer working
 Key: BEAM-2812
 URL: https://issues.apache.org/jira/browse/BEAM-2812
 Project: Beam
  Issue Type: Bug
  Components: runner-spark
Reporter: Aviem Zur
Assignee: Amit Sela


In https://github.com/apache/beam/pull/2838 aggregators were removed from Spark 
runner, this caused regression around dropped windows counters and logs.

{{CounterCell}} instances are created ad hoc instead of using the {{Metrics}} 
class static factory methods: 
[SparkGroupAlsoByWindowViaWindowSet.java#L213-L219|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L213-L219]
Context of where the metrics are reported isn't taken into account, and since 
these counters are being passed to a lazily evaluated iterator 
[SparkGroupAlsoByWindowViaWindowSet.java#L221-L223|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L221-L223]
 the subsequent code which looks at the counters is always looking at these 
counters immediately after initialization, before they are populated, so these 
prints will never happen since the conditional statements do not check on the 
right counters 
[SparkGroupAlsoByWindowViaWindowSet.java#L323-L333|https://github.com/apache/beam/blob/v2.1.0/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L323-L333].

Additionally, {{org.apache.beam.runners.core.LateDataUtils#dropExpiredWindows}} 
now takes a {{CounterCell}} as a parameter, which is a class for metrics 
implementation and should generally not be used elsewhere (this is also 
mentioned in its Javadoc), we should look into changing this method to use 
something else and perhaps make {{CounterCell}} and similar classes package 
private (And change runner code which uses these to be in the same package).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2803) JdbcIO read is very slow when query return a lot of rows

2017-08-28 Thread JIRA


[ 
https://issues.apache.org/jira/browse/BEAM-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143489#comment-16143489
 ] 

Jérémie Vexiau commented on BEAM-2803:
--

I think also side input is a good solution. especially for jdbc read, because 
we don't know the side of the bounded collection

> JdbcIO read is very slow when query return a lot of rows
> 
>
> Key: BEAM-2803
> URL: https://issues.apache.org/jira/browse/BEAM-2803
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Affects Versions: Not applicable
>Reporter: Jérémie Vexiau
>Assignee: Jean-Baptiste Onofré
>  Labels: performance
> Fix For: Not applicable
>
> Attachments: test1500K.png, test1M.png, test2M.jpg, test500k.png
>
>
> Hi,
> I'm using JdbcIO reader in batch mode with the postgresql driver.
> my select query return more than 5 Millions rows
> using cursors with Statement.setFetchSize().
> these ParDo are OK :
> {code:java}
>   .apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
>   .apply(ParDo.of(new DoFn>() {
> private Random random;
> @Setup
> public void setup() {
>   random = new Random();
> }
> @ProcessElement
> public void processElement(ProcessContext context) {
>   context.output(KV.of(random.nextInt(), context.element()));
> }
>   }))
> {code}
> but reshuffle is very very slow. 
> it must be the GroupByKey with more than 5 millions of Key.
> {code:java}
> .apply(GroupByKey.create())
> {code}
> is there a way to optimize the reshuffle, or use another method to prevent 
> fusion ? 
> thanks in advance,
> edit: 
> I add some tests 
> I use google dataflow as runner, with 1 worker, 2 max, and workerMachineType 
> n1-standard-2
> and  autoscalingAlgorithm THROUGHPUT_BASED
> First one : query return 500 000 results : 
> !test500k.png!
> as we can see,
>  parDo(Read) is about 1300 r/s
> groupByKey is about 1080 r/s
> 2nd : query return 1 000 000 results 
> !test1M.png!
> parDo(read) => 1480 r/s
> groupByKey => 634 r/s
> 3rd : query return 1 500 000 results
> !test1500K.png!
> parDo(read) => 1700 r/s
> groupByKey => 565 r/s
> 4th query return 2 000 000 results
> !test2M.jpg!
> parDo(read) => 1485 r/s
> groupByKey => 537 r/s
> As we can see, groupByKey  rate decrease when number of record are more 
> important.
> ps:  2nd worker start just after ParDo(read) is succeed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Jenkins build is unstable: beam_PostCommit_Java_MavenInstall #4660

2017-08-28 Thread Apache Jenkins Server

See

Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Dataflow #3854

2017-08-28 Thread Apache Jenkins Server

See

[jira] [Created] (BEAM-2811) Complete the removal of aggregators from Spark runner.

2017-08-28 Thread Aviem Zur (JIRA)

Aviem Zur created BEAM-2811:
---

 Summary: Complete the removal of aggregators from Spark runner.
 Key: BEAM-2811
 URL: https://issues.apache.org/jira/browse/BEAM-2811
 Project: Beam
  Issue Type: Task
  Components: runner-spark
Reporter: Aviem Zur
Assignee: Amit Sela


In https://issues.apache.org/jira/browse/BEAM-1765 aggregator usage was removed 
from Spark runner in order to comply with the removal of aggregators from the 
SDK.

However, not all classes / usages of aggregators within the Spark runner were 
removed.

This task is to complete the removal of all aggregator code from Spark runner, 
including, but not limited to:

{code}
org.apache.beam.runners.spark.aggregators.AggAccumParam
org.apache.beam.runners.spark.aggregators.AggregatorsAccumulator
org.apache.beam.runners.spark.aggregators.NamedAggregators
org.apache.beam.runners.spark.metrics.AggregatorMetricSource
org.apache.beam.runners.spark.metrics.AggregatorMetric
{code}

In addition, remove packages named {{aggregators}} and move any remaining 
relevant classes to {{org.apache.beam.runners.spark.metrics}} package.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2760) Disable testMergingCustomWindows* validatesRunner tests in Gearpump runner

2017-08-28 Thread Etienne Chauchot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143459#comment-16143459
 ] 

Etienne Chauchot commented on BEAM-2760:


Hi [~vectorijk]. This is true. That is why custom window merging test is 
disabled in spark (see 
https://github.com/apache/beam/blob/d50c9644e95fdd99fcf12c1afdf2822fe4b2b3e2/runners/spark/pom.xml#L80)
 and that is why this (https://issues.apache.org/jira/browse/BEAM-2499) ticket 
was created

> Disable testMergingCustomWindows* validatesRunner tests in Gearpump runner
> --
>
> Key: BEAM-2760
> URL: https://issues.apache.org/jira/browse/BEAM-2760
> Project: Beam
>  Issue Type: Test
>  Components: runner-gearpump
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
> Fix For: Not applicable
>
>
> Disable these tests until it is supported by the runner



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

1 2 >

1 - 100 of 101 matches

Mail list logo