[jira] [Commented] (BEAM-636) Add Char, Byte to TypeDescriptors

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507144#comment-15507144
 ] 

ASF GitHub Bot commented on BEAM-636:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/966


> Add Char, Byte to TypeDescriptors
> -
>
> Key: BEAM-636
> URL: https://issues.apache.org/jira/browse/BEAM-636
> Project: Beam
>  Issue Type: Bug
>Reporter: Jesse Anderson
>Assignee: Jesse Anderson
>
> The TypeDescriptors class is missing the Char and Byte TypeDescriptor methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #966: [BEAM-636] Add Char, Byte to TypeDescripto...

2016-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/966


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/2] incubator-beam git commit: Updates lint configurations to ignore generated files.

2016-09-20 Thread robertwb
Updates lint configurations to ignore generated files.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/49c03593
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/49c03593
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/49c03593

Branch: refs/heads/python-sdk
Commit: 49c0359327ba418cfe62ef2291560d1b1867f4e5
Parents: adda163
Author: Chamikara Jayalath 
Authored: Mon Sep 19 22:27:47 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Sep 20 09:10:01 2016 -0700

--
 sdks/python/run_pylint.sh | 28 
 sdks/python/tox.ini   |  3 ---
 2 files changed, 24 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/49c03593/sdks/python/run_pylint.sh
--
diff --git a/sdks/python/run_pylint.sh b/sdks/python/run_pylint.sh
index 6114034..b69ea72 100755
--- a/sdks/python/run_pylint.sh
+++ b/sdks/python/run_pylint.sh
@@ -33,6 +33,17 @@ set -o pipefail
 git remote set-branches --add origin $BASE_BRANCH
 git fetch
 
+# Following generated files are excluded from lint checks.
+EXCLUDED_GENERATED_FILES=(
+"apache_beam/internal/windmill_pb2.py"
+"apache_beam/internal/windmill_service_pb2.py"
+"apache_beam/internal/clients/bigquery/bigquery_v2_client.py"
+"apache_beam/internal/clients/bigquery/bigquery_v2_messages.py"
+"apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py"
+"apache_beam/internal/clients/dataflow/dataflow_v1b3_messages.py"
+"apache_beam/internal/clients/storage/storage_v1_client.py"
+"apache_beam/internal/clients/storage/storage_v1_messages.py")
+
 # Get the name of the files that changed compared to the HEAD of the branch.
 # Use diff-filter to exclude deleted files. (i.e. Do not try to lint files that
 # does not exist any more.) Filter the output to .py files only. Rewrite the
@@ -41,12 +52,21 @@ CHANGED_FILES=$(git diff --name-only --diff-filter=ACMRTUXB 
origin/$BASE_BRANCH
 | { grep ".py$" || true; }  \
 | sed 's/sdks\/python\///g')
 
-if test "$CHANGED_FILES"; then
+FILES_TO_CHECK=""
+for file in $CHANGED_FILES;
+do
+if [[ " ${EXCLUDED_GENERATED_FILES[@]} " =~ " ${file} " ]]; then
+  echo "Excluded file " $file " from lint checks"
+else
+  FILES_TO_CHECK="$FILES_TO_CHECK $file"
+fi
+done
+
+if test "$FILES_TO_CHECK"; then
   echo "Running pylint on changed files:"
-  echo "$CHANGED_FILES"
-  pylint $CHANGED_FILES
+  pylint $FILES_TO_CHECK
   echo "Running pep8 on changed files:"
-  pep8 $CHANGED_FILES
+  pep8 $FILES_TO_CHECK
 else
   echo "Not running pylint. No eligible files."
 fi

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/49c03593/sdks/python/tox.ini
--
diff --git a/sdks/python/tox.ini b/sdks/python/tox.ini
index 5a2572e..20d1961 100644
--- a/sdks/python/tox.ini
+++ b/sdks/python/tox.ini
@@ -23,9 +23,6 @@ envlist = py27
 # pylint does not check the number of blank lines.
 select = E3
 
-# Skip auto generated files (windmill_pb2.py, windmill_service_pb2.py)
-exclude = windmill_pb2.py, windmill_service_pb2.py
-
 [testenv:py27]
 deps=
   pep8



[1/2] incubator-beam git commit: Closes #979

2016-09-20 Thread robertwb
Repository: incubator-beam
Updated Branches:
  refs/heads/python-sdk adda16320 -> b6c7478ff


Closes #979


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/b6c7478f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/b6c7478f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/b6c7478f

Branch: refs/heads/python-sdk
Commit: b6c7478ff0a17b23b3aa603310b2f5254f350392
Parents: adda163 49c0359
Author: Robert Bradshaw 
Authored: Tue Sep 20 09:10:01 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Sep 20 09:10:01 2016 -0700

--
 sdks/python/run_pylint.sh | 28 
 sdks/python/tox.ini   |  3 ---
 2 files changed, 24 insertions(+), 7 deletions(-)
--




[jira] [Commented] (BEAM-625) Make Dataflow Python Materialized PCollection representation more efficient

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507020#comment-15507020
 ] 

ASF GitHub Bot commented on BEAM-625:
-

Github user katsiapis closed the pull request at:

https://github.com/apache/incubator-beam/pull/976


> Make Dataflow Python Materialized PCollection representation more efficient
> ---
>
> Key: BEAM-625
> URL: https://issues.apache.org/jira/browse/BEAM-625
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Konstantinos Katsiapis
>Assignee: Frances Perry
>
> This will be a several step process which will involve adding better support 
> for compression as well as Avro.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #976: [BEAM-625] Making Dataflow Python Material...

2016-09-20 Thread katsiapis
Github user katsiapis closed the pull request at:

https://github.com/apache/incubator-beam/pull/976


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1175

2016-09-20 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-643) Allow users to specify a custom service account

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507203#comment-15507203
 ] 

ASF GitHub Bot commented on BEAM-643:
-

Github user chamikaramj closed the pull request at:

https://github.com/apache/incubator-beam/pull/979


> Allow users to specify a custom service account
> ---
>
> Key: BEAM-643
> URL: https://issues.apache.org/jira/browse/BEAM-643
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Users should be able to specify a custom service account which can be used 
> when creating VMs. This feature is specify to DataflowRunner and 
> corresponding user option will be added to GoogleCloudOptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-646) Get runners out of the apply()

2016-09-20 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-646:


 Summary: Get runners out of the apply()
 Key: BEAM-646
 URL: https://issues.apache.org/jira/browse/BEAM-646
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core
Reporter: Kenneth Knowles
Assignee: Thomas Groh


Right now, the runner intercepts calls to apply() and replaces transforms as we 
go. This means that there is no "original" user graph. For portability and misc 
architectural benefits, we would like to build the original graph first, and 
have the runner override later.

Some runners already work in this manner, but we could integrate it more 
smoothly, with more validation, via some handy APIs on e.g. the Pipeline object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-644) Primitive to shift the watermark while assigning timestamps

2016-09-20 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507379#comment-15507379
 ] 

Ben Chambers commented on BEAM-644:
---

Minor note on "A function from TimestampedElement to new timestamp that 
always falls within D of the original timestamp."

Rather than "within D" I think the requirement is that for an input with 
timestamp t, the output timestamp is >= t+D. This ensures that the output 
timestamps relation to the output watermark is no later than the input 
timestamps relation to the input watermark.

> Primitive to shift the watermark while assigning timestamps
> ---
>
> Key: BEAM-644
> URL: https://issues.apache.org/jira/browse/BEAM-644
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> There is a general need, especially important in the presence of 
> SplittableDoFn, to be able to assign new timestamps to elements without 
> making them late or droppable.
>  - DoFn.withAllowedTimestampSkew is inadequate, because it simply allows one 
> to produce late data, but does not allow one to shift the watermark so the 
> new data is on-time.
>  - For a SplittableDoFn, one may receive an element such as the name of a log 
> file that contains elements for the day preceding the log file. The timestamp 
> on the filename must currently be the beginning of the log. If such elements 
> are constantly flowing, it may be OK, but since we don't know that element is 
> coming, in that absence of data, the watermark may advance. We need a way to 
> keep it far enough back even in the absence of data holding it back.
> One idea is a new primitive ShiftWatermark / AdjustTimestamps with the 
> following pieces:
>  - A constant duration (positive or negative) D by which to shift the 
> watermark.
>  - A function from TimestampedElement to new timestamp that always falls 
> within D of the original timestamp.
> With this primitive added, outputWithTimestamp and withAllowedTimestampSkew 
> could be removed, simplifying DoFn.
> Alternatively, all of this functionality could be bolted on to DoFn.
> This ticket is not a proposal, but a record of the issue and ideas that were 
> mentioned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-20 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507410#comment-15507410
 ] 

Amit Sela commented on BEAM-645:


Does this happen with the SDKs example or the WordCount example in the runner: 
org.apache.beam.runners.spark.examples.WordCount ?
There are issues with validation for HDFS in TextIO, I think they are related 
to IOChannelUtils.

This is actually the SDK not supporting higher-level translation of TextIO - 
meaning you can't simply pass the TextIO's properties to the appropriate Spark 
implementation "sc.textFile()". The file you created "fooled" the validation, 
and then Spark could kick-in. The runner's example simply applies 
TextIO.withoutValidation()

Currently, the SDK requires runners to support "Read.Bounded" which is a WIP 
and covered under BEAM-17.

The IOChannelUtils issue is covered in BEAM-59.

I'm not sure this issue is not covered already from both the runner and the 
SDK. [~eljefe6aa] WDYT? 

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[2/2] incubator-beam git commit: This closes #966

2016-09-20 Thread kenn
This closes #966


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/5c23f495
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/5c23f495
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/5c23f495

Branch: refs/heads/master
Commit: 5c23f4954ed04b7e271b84f0f639410e76a6285e
Parents: 9e7ed29 a596835
Author: Kenneth Knowles 
Authored: Tue Sep 20 10:07:21 2016 -0700
Committer: Kenneth Knowles 
Committed: Tue Sep 20 10:07:21 2016 -0700

--
 .../apache/beam/sdk/values/TypeDescriptors.java | 24 
 1 file changed, 24 insertions(+)
--




[jira] [Created] (BEAM-647) Faul-tolerant sideInputs via Broadcast variables.

2016-09-20 Thread Amit Sela (JIRA)
Amit Sela created BEAM-647:
--

 Summary: Faul-tolerant sideInputs via Broadcast variables.
 Key: BEAM-647
 URL: https://issues.apache.org/jira/browse/BEAM-647
 Project: Beam
  Issue Type: Bug
  Components: runner-spark
Reporter: Amit Sela
Assignee: Amit Sela


Following https://github.com/apache/incubator-beam/pull/909 which enables 
checkpointing to recover from failures, sideInputs (being implemented by 
broadcast variables) should be handled in a specific manner as described here: 
http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#accumulators-and-broadcast-variables.

This is a bit more complicated than Aggregators (via Accumulators) as they are 
implemented using a single "aggregating"  Accumulator, while a pipeline may 
contain multiple sideInputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-20 Thread Jesse Anderson (JIRA)
Jesse Anderson created BEAM-645:
---

 Summary: Running Wordcount in Spark Checks Locally and Outputs in 
HDFS
 Key: BEAM-645
 URL: https://issues.apache.org/jira/browse/BEAM-645
 Project: Beam
  Issue Type: Bug
  Components: runner-spark
Affects Versions: 0.3.0-incubating
Reporter: Jesse Anderson
Assignee: Amit Sela


When running the Wordcount example with the Spark runner, the Spark runner uses 
the input file in HDFS. When the program performs its startup checks, it looks 
for the file in the local filesystem.

To workaround this issue, you have to create a file in the local filesystem and 
put the actual file in HDFS.

Here is the stack trace when the file doesn't exist in the local filesystem:
{quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
find any files matching Macbeth.txt
at 
org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
at 
org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
at org.apache.beam.examples.WordCount.main(WordCount.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #965: Removed unnecessary throttling of rename p...

2016-09-20 Thread mdvorsky
Github user mdvorsky closed the pull request at:

https://github.com/apache/incubator-beam/pull/965


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-654) When and how can merging windows "shrink" or "grow"?

2016-09-20 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-654:


 Summary: When and how can merging windows "shrink" or "grow"?
 Key: BEAM-654
 URL: https://issues.apache.org/jira/browse/BEAM-654
 Project: Beam
  Issue Type: New Feature
  Components: beam-model
Reporter: Kenneth Knowles
Assignee: Kenneth Knowles


The primary example of a merging window today is {{Sessions}} by gap duration, 
in which the merged window is the interval enclosure / span of the windows 
being merged.

However, another reasonable abstract use case is a session identified by id 
with an explicit end event. We might consider that the session ends with no gap 
duration after the end event. In this case, the merged window may be smaller 
than the enclosure of the sub-windows. Sometimes this has been referred to as 
"merging shrinks the window".

Perhaps the only requirement is that the merged window contains the timestamps 
of the data therein, but we should document this clearly. The current spec is 
["Does whatever merging is 
necessary"|https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/WindowFn.java#L106]

There are repercussions for triggers, some documented in the [trigger design 
doc|https://s.apache.org/beam-triggers]: With nonzero allowed lateness, 
{{Sessions}} by gap duration can switch a trigger from ON_TIME or LATE behavior 
back to speculative behavior, or cause another ON_TIME firing. Conversely, 
sessions with abrupt termination/shrinking may have that behavior _as well as_ 
an ON_TIME and subsequent LATE firings due only to the merging (this already 
works properly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1177

2016-09-20 Thread Apache Jenkins Server
See 




[jira] [Created] (BEAM-644) Primitive to shift the watermark while assigning timestamps

2016-09-20 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-644:


 Summary: Primitive to shift the watermark while assigning 
timestamps
 Key: BEAM-644
 URL: https://issues.apache.org/jira/browse/BEAM-644
 Project: Beam
  Issue Type: New Feature
  Components: beam-model
Reporter: Kenneth Knowles
Assignee: Kenneth Knowles


There is a general need, especially important in the presence of 
SplittableDoFn, to be able to assign new timestamps to elements without making 
them late or droppable.

 - DoFn.withAllowedTimestampSkew is inadequate, because it simply allows one to 
produce late data, but does not allow one to shift the watermark so the new 
data is on-time.
 - For a SplittableDoFn, one may receive an element such as the name of a log 
file that contains elements for the day preceding the log file. The timestamp 
on the filename must currently be the beginning of the log. If such elements 
are constantly flowing, it may be OK, but since we don't know that element is 
coming, in that absence of data, the watermark may advance. We need a way to 
keep it far enough back even in the absence of data holding it back.

One idea is a new primitive ShiftWatermark / AdjustTimestamps with the 
following pieces:

 - A constant duration (positive or negative) D by which to shift the watermark.
 - A function from TimestampedElement to new timestamp that always falls 
within D of the original timestamp.

With this primitive added, outputWithTimestamp and withAllowedTimestampSkew 
could be removed, simplifying DoFn.

Alternatively, all of this functionality could be bolted on to DoFn.

This ticket is not a proposal, but a record of the issue and ideas that were 
mentioned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-652) StateNamespaces.window() potentially unstable / stable encodings for windows

2016-09-20 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-652:


 Summary: StateNamespaces.window() potentially unstable / stable 
encodings for windows
 Key: BEAM-652
 URL: https://issues.apache.org/jira/browse/BEAM-652
 Project: Beam
  Issue Type: New Feature
  Components: sdk-java-core
Reporter: Kenneth Knowles


WindowNamespace#stringKey includes a base64 encoding of the window as encoded 
by the windowCoder. This makes the key unpredictable if we support features 
such as reloading a pipeline with a new SDK or bugfixed pipeline. The encoding 
should be stable and documented.

And even if the SDK's window types are stable and documented, users can define 
their own windows with coders and these may be less stable / requiring 
bugfixes. This ticket is to track the issue and any new ideas.

[1] 
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/state/StateNamespaces.java#L113



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-653) Refine specification for WindowFn.isCompatible()

2016-09-20 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-653:


 Summary: Refine specification for WindowFn.isCompatible() 
 Key: BEAM-653
 URL: https://issues.apache.org/jira/browse/BEAM-653
 Project: Beam
  Issue Type: New Feature
  Components: beam-model
Reporter: Kenneth Knowles


{{WindowFn#isCompatible}} doesn't really have a spec. In practice, it is used 
primarily when flattening together multiple PCollections. All of the WindowFns 
must be compatible, and then just a single WindowFn is selected arbitrarily for 
the output PCollection.

In consequence, downstream of the Flatten, the merging behavior will be taken 
from this WindowFn.

Currently, there are some mismatches:

 - Sessions with different gap durations _are_ compatible today, but probably 
shouldn't be since merging makes little sense. (The use of tiny proto-windows 
is an implementation detail anyhow)
 - SlidingWindows and FixedWindows _could_ reasonably be compatible if they had 
the same duration, though it might be odd.

Either way, we should just nail down what we actually mean so we can arrive at 
a verdict in these cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[2/3] incubator-beam git commit: Updates Dataflow API client.

2016-09-20 Thread robertwb
Updates Dataflow API client.

These files were generated using following command but I updated the
licence and package name manually to match the current version.

gen_client --discovery_url=dataflow.v1b3 --overwrite --outdir=out pip_package


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/2e3384e6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/2e3384e6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/2e3384e6

Branch: refs/heads/python-sdk
Commit: 2e3384e62ec14f41469c45c3701c1236242dc74c
Parents: b6c7478
Author: Chamikara Jayalath 
Authored: Mon Sep 19 22:16:04 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Sep 20 14:31:48 2016 -0700

--
 .../clients/dataflow/dataflow_v1b3_client.py|  106 +-
 .../clients/dataflow/dataflow_v1b3_messages.py  | 1435 +-
 2 files changed, 1175 insertions(+), 366 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/2e3384e6/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py
--
diff --git 
a/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py 
b/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py
index 1416638..840b887 100644
--- a/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py
+++ b/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_client.py
@@ -25,6 +25,7 @@ class DataflowV1b3(base_api.BaseApiClient):
   """Generated client library for service dataflow version v1b3."""
 
   MESSAGES_MODULE = messages
+  BASE_URL = u'https://dataflow.googleapis.com/'
 
   _PACKAGE = u'dataflow'
   _SCOPES = [u'https://www.googleapis.com/auth/cloud-platform', 
u'https://www.googleapis.com/auth/userinfo.email']
@@ -42,7 +43,7 @@ class DataflowV1b3(base_api.BaseApiClient):
credentials_args=None, default_global_params=None,
additional_http_headers=None):
 """Create a new dataflow handle."""
-url = url or u'https://dataflow.googleapis.com/'
+url = url or self.BASE_URL
 super(DataflowV1b3, self).__init__(
 url, credentials=credentials,
 get_credentials=get_credentials, http=http, model=model,
@@ -50,11 +51,76 @@ class DataflowV1b3(base_api.BaseApiClient):
 credentials_args=credentials_args,
 default_global_params=default_global_params,
 additional_http_headers=additional_http_headers)
+self.projects_jobs_debug = self.ProjectsJobsDebugService(self)
 self.projects_jobs_messages = self.ProjectsJobsMessagesService(self)
 self.projects_jobs_workItems = self.ProjectsJobsWorkItemsService(self)
 self.projects_jobs = self.ProjectsJobsService(self)
+self.projects_templates = self.ProjectsTemplatesService(self)
 self.projects = self.ProjectsService(self)
 
+  class ProjectsJobsDebugService(base_api.BaseApiService):
+"""Service class for the projects_jobs_debug resource."""
+
+_NAME = u'projects_jobs_debug'
+
+def __init__(self, client):
+  super(DataflowV1b3.ProjectsJobsDebugService, self).__init__(client)
+  self._method_configs = {
+  'GetConfig': base_api.ApiMethodInfo(
+  http_method=u'POST',
+  method_id=u'dataflow.projects.jobs.debug.getConfig',
+  ordered_params=[u'projectId', u'jobId'],
+  path_params=[u'jobId', u'projectId'],
+  query_params=[],
+  
relative_path=u'v1b3/projects/{projectId}/jobs/{jobId}/debug/getConfig',
+  request_field=u'getDebugConfigRequest',
+  request_type_name=u'DataflowProjectsJobsDebugGetConfigRequest',
+  response_type_name=u'GetDebugConfigResponse',
+  supports_download=False,
+  ),
+  'SendCapture': base_api.ApiMethodInfo(
+  http_method=u'POST',
+  method_id=u'dataflow.projects.jobs.debug.sendCapture',
+  ordered_params=[u'projectId', u'jobId'],
+  path_params=[u'jobId', u'projectId'],
+  query_params=[],
+  
relative_path=u'v1b3/projects/{projectId}/jobs/{jobId}/debug/sendCapture',
+  request_field=u'sendDebugCaptureRequest',
+  request_type_name=u'DataflowProjectsJobsDebugSendCaptureRequest',
+  response_type_name=u'SendDebugCaptureResponse',
+  supports_download=False,
+  ),
+  }
+
+  self._upload_configs = {
+  }
+
+def GetConfig(self, request, global_params=None):
+  """Get encoded debug configuration for component. Not cacheable.
+
+  Args:
+request: (DataflowProjectsJobsDebugGetConfigRequest) input message
+   

[GitHub] incubator-beam pull request #978: [BEAM-643] Updates Dataflow API client.

2016-09-20 Thread chamikaramj
Github user chamikaramj closed the pull request at:

https://github.com/apache/incubator-beam/pull/978


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (BEAM-643) Allow users to specify a custom service account

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507878#comment-15507878
 ] 

ASF GitHub Bot commented on BEAM-643:
-

Github user chamikaramj closed the pull request at:

https://github.com/apache/incubator-beam/pull/975


> Allow users to specify a custom service account
> ---
>
> Key: BEAM-643
> URL: https://issues.apache.org/jira/browse/BEAM-643
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Users should be able to specify a custom service account which can be used 
> when creating VMs. This feature is specify to DataflowRunner and 
> corresponding user option will be added to GoogleCloudOptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #975: [BEAM-643] Adds support for specifying a c...

2016-09-20 Thread chamikaramj
Github user chamikaramj closed the pull request at:

https://github.com/apache/incubator-beam/pull/975


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (BEAM-652) StateNamespaces.window() potentially unstable / stable encodings for windows

2016-09-20 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-652:
-
Issue Type: Improvement  (was: New Feature)

> StateNamespaces.window() potentially unstable / stable encodings for windows
> 
>
> Key: BEAM-652
> URL: https://issues.apache.org/jira/browse/BEAM-652
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>
> WindowNamespace#stringKey includes a base64 encoding of the window as encoded 
> by the windowCoder. This makes the key unpredictable if we support features 
> such as reloading a pipeline with a new SDK or bugfixed pipeline. The 
> encoding should be stable and documented.
> And even if the SDK's window types are stable and documented, users can 
> define their own windows with coders and these may be less stable / requiring 
> bugfixes. This ticket is to track the issue and any new ideas.
> [1] 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/state/StateNamespaces.java#L113



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-643) Allow users to specify a custom service account

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507880#comment-15507880
 ] 

ASF GitHub Bot commented on BEAM-643:
-

Github user chamikaramj closed the pull request at:

https://github.com/apache/incubator-beam/pull/978


> Allow users to specify a custom service account
> ---
>
> Key: BEAM-643
> URL: https://issues.apache.org/jira/browse/BEAM-643
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Users should be able to specify a custom service account which can be used 
> when creating VMs. This feature is specify to DataflowRunner and 
> corresponding user option will be added to GoogleCloudOptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[3/3] incubator-beam git commit: Closes #978

2016-09-20 Thread robertwb
Closes #978


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/c1964bdd
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/c1964bdd
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/c1964bdd

Branch: refs/heads/python-sdk
Commit: c1964bdd6f96d9e488366350df0cee79278106aa
Parents: b6c7478 2e3384e
Author: Robert Bradshaw 
Authored: Tue Sep 20 14:32:14 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Sep 20 14:32:14 2016 -0700

--
 .../clients/dataflow/dataflow_v1b3_client.py|  106 +-
 .../clients/dataflow/dataflow_v1b3_messages.py  | 1435 +-
 2 files changed, 1175 insertions(+), 366 deletions(-)
--




[1/3] incubator-beam git commit: Updates Dataflow API client.

2016-09-20 Thread robertwb
Repository: incubator-beam
Updated Branches:
  refs/heads/python-sdk b6c7478ff -> c1964bdd6


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/2e3384e6/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_messages.py
--
diff --git 
a/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_messages.py 
b/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_messages.py
index 8851231..178a542 100644
--- 
a/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_messages.py
+++ 
b/sdks/python/apache_beam/internal/clients/dataflow/dataflow_v1b3_messages.py
@@ -49,32 +49,35 @@ class ApproximateReportedProgress(_messages.Message):
 
   Fields:
 consumedParallelism: Total amount of parallelism in the portion of input
-  of this work item that has already been consumed. In the first two
-  examples above (see remaining_parallelism), the value should be 30 or 3
-  respectively. The sum of remaining_parallelism and consumed_parallelism
-  should equal the total amount of parallelism in this work item. If
-  specified, must be finite.
+  of this task that has already been consumed and is no longer active. In
+  the first two examples above (see remaining_parallelism), the value
+  should be 29 or 2 respectively.  The sum of remaining_parallelism and
+  consumed_parallelism should equal the total amount of parallelism in
+  this work item.  If specified, must be finite.
 fractionConsumed: Completion as fraction of the input consumed, from 0.0
   (beginning, nothing consumed), to 1.0 (end of the input, entire input
   consumed).
 position: A Position within the work to represent a progress.
 remainingParallelism: Total amount of parallelism in the input of this
-  WorkItem that has not been consumed yet (i.e. can be delegated to a new
-  WorkItem via dynamic splitting). "Amount of parallelism" refers to how
-  many non-empty parts of the input can be read in parallel. This does not
+  task that remains, (i.e. can be delegated to this task and any new tasks
+  via dynamic splitting). Always at least 1 for non-finished work items
+  and 0 for finished.  "Amount of parallelism" refers to how many non-
+  empty parts of the input can be read in parallel. This does not
   necessarily equal number of records. An input that can be read in
   parallel down to the individual records is called "perfectly
   splittable". An example of non-perfectly parallelizable input is a
   block-compressed file format where a block of records has to be read as
-  a whole, but different blocks can be read in parallel. Examples: * If we
-  have read 30 records out of 50 in a perfectly splittable 50-record
-  input, this value should be 20. * If we are reading through block 3 in a
-  block-compressed file consisting of 5 blocks, this value should be 2
-  (since blocks 4 and 5 can be processed in parallel by new work items via
-  dynamic splitting). * If we are reading through the last block in a
-  block-compressed file, or reading or processing the last record in a
-  perfectly splittable input, this value should be 0, because the
-  remainder of the work item cannot be further split.
+  a whole, but different blocks can be read in parallel.  Examples: * If
+  we are processing record #30 (starting at 1) out of 50 in a perfectly
+  splittable 50-record input, this value should be 21 (20 remaining + 1
+  current). * If we are reading through block 3 in a block-compressed file
+  consisting   of 5 blocks, this value should be 3 (since blocks 4 and 5
+  can be   processed in parallel by new tasks via dynamic splitting and
+  the current   task remains processing block 3). * If we are reading
+  through the last block in a block-compressed file,   or reading or
+  processing the last record in a perfectly splittable   input, this value
+  should be 1, because apart from the current task, no   additional
+  remainder can be split off.
   """
 
   consumedParallelism = _messages.MessageField('ReportedParallelism', 1)
@@ -112,9 +115,10 @@ class AutoscalingSettings(_messages.Message):
 """The algorithm to use for autoscaling.
 
 Values:
-  AUTOSCALING_ALGORITHM_UNKNOWN: 
-  AUTOSCALING_ALGORITHM_NONE: 
-  AUTOSCALING_ALGORITHM_BASIC: 
+  AUTOSCALING_ALGORITHM_UNKNOWN: The algorithm is unknown, or unspecified.
+  AUTOSCALING_ALGORITHM_NONE: Disable autoscaling.
+  AUTOSCALING_ALGORITHM_BASIC: Increase worker count over time to reduce
+job execution time.
 """
 AUTOSCALING_ALGORITHM_UNKNOWN = 0
 AUTOSCALING_ALGORITHM_NONE = 1
@@ -160,6 +164,221 @@ class ConcatPosition(_messages.Message):
   position = _messages.MessageField('Position', 2)
 
 
+class CounterMetadata(_messages.Message):
+  

[jira] [Commented] (BEAM-638) Add a Window function to create a bounded PCollection from an unbounded one

2016-09-20 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507706#comment-15507706
 ] 

Aljoscha Krettek commented on BEAM-638:
---

I don't think it's possible to provide such a function.

What would the semantics of such a function be? That is, when would it consider 
that window "done", what happens to the computation upstream from that 
function/transformation, would it be canceled? Also, which window would be the 
one window that we take, for example from {{FixedWindows}}?

> Add a Window function to create a bounded PCollection from an unbounded one
> ---
>
> Key: BEAM-638
> URL: https://issues.apache.org/jira/browse/BEAM-638
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Jean-Baptiste Onofré
>Assignee: Davor Bonaci
>
> Today, if the pipeline source is unbounded, and the sink expects a bounded 
> collection, there's no way to use a single pipeline. Even a window creates a 
> chunk on the unbounded PCollection, but the "sub" PCollection is still 
> unbounded.
> It would be helpful for users to have a Window function that create a bounded 
> PCollection (on the window) from an unbounded PCollection coming from the 
> source.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[1/2] incubator-beam git commit: Closes #975

2016-09-20 Thread robertwb
Repository: incubator-beam
Updated Branches:
  refs/heads/python-sdk c1964bdd6 -> acd8d7952


Closes #975


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/acd8d795
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/acd8d795
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/acd8d795

Branch: refs/heads/python-sdk
Commit: acd8d795257a809528baed680fb71ba0aa272ee5
Parents: c1964bd 6df99b2
Author: Robert Bradshaw 
Authored: Tue Sep 20 14:37:39 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Sep 20 14:37:39 2016 -0700

--
 sdks/python/apache_beam/internal/apiclient.py | 4 
 sdks/python/apache_beam/utils/options.py  | 3 +++
 2 files changed, 7 insertions(+)
--




[2/2] incubator-beam git commit: Adds support for specifying a custom service account.

2016-09-20 Thread robertwb
Adds support for specifying a custom service account.

Updates Dataflow API client to latest version.

Adds ability to skip generated files during lint checks.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/6df99b2d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/6df99b2d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/6df99b2d

Branch: refs/heads/python-sdk
Commit: 6df99b2dcfde47885db0116f02a0c7211f2dc4bd
Parents: c1964bd
Author: Chamikara Jayalath 
Authored: Mon Sep 19 15:52:53 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Sep 20 14:37:39 2016 -0700

--
 sdks/python/apache_beam/internal/apiclient.py | 4 
 sdks/python/apache_beam/utils/options.py  | 3 +++
 2 files changed, 7 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/6df99b2d/sdks/python/apache_beam/internal/apiclient.py
--
diff --git a/sdks/python/apache_beam/internal/apiclient.py 
b/sdks/python/apache_beam/internal/apiclient.py
index bc4a4e0..3f82f29 100644
--- a/sdks/python/apache_beam/internal/apiclient.py
+++ b/sdks/python/apache_beam/internal/apiclient.py
@@ -117,6 +117,10 @@ class Environment(object):
 self.proto.userAgent = dataflow.Environment.UserAgentValue()
 self.local = 'localhost' in self.google_cloud_options.dataflow_endpoint
 
+if self.google_cloud_options.service_account_email:
+  self.proto.serviceAccountEmail = (
+  self.google_cloud_options.service_account_email)
+
 sdk_name, version_string = get_sdk_name_and_version()
 
 self.proto.userAgent.additionalProperties.extend([

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/6df99b2d/sdks/python/apache_beam/utils/options.py
--
diff --git a/sdks/python/apache_beam/utils/options.py 
b/sdks/python/apache_beam/utils/options.py
index 794a10d..700c080 100644
--- a/sdks/python/apache_beam/utils/options.py
+++ b/sdks/python/apache_beam/utils/options.py
@@ -248,6 +248,9 @@ class GoogleCloudOptions(PipelineOptions):
 default=None,
 help='Path to a file containing the P12 service '
 'credentials.')
+parser.add_argument('--service_account_email',
+default=None,
+help='Identity to run virtual machines as.')
 parser.add_argument('--no_auth', dest='no_auth', type=bool, default=False)
 
   def validate(self, validator):



[jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-20 Thread Jesse Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507437#comment-15507437
 ] 

Jesse Anderson commented on BEAM-645:
-

No, it doesn't happen with {{org.apache.beam.runners.spark.examples.WordCount}}.

I think the issue is subtly different. With Spark, you can define a default fs. 
You could define it to be file: or hdfs: and that knowledge won't be 
transferred from the Spark runner to the TextIO.

There more information on replicating here if you need it 
https://github.com/eljefe6a/beamexample#apache-spark.

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-20 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507498#comment-15507498
 ] 

Amit Sela commented on BEAM-645:


I think it's the other way around - TextIO is the one passing knowledge to the 
runner, but before doing so, it validates the filepattern and fails.
The "touch local" hack suffices this validation, and moves on to execution 
where the read operation is translated to "sc.textFile".

What subtlety am i missing ? simply try the SDK's example but call 
"withoutValidation()" on the read, should work as it's the only difference 
between the examples.

And again, doing my best to support Read.Bound before the 27th (right ?) 

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-20 Thread Jesse Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507582#comment-15507582
 ] 

Jesse Anderson commented on BEAM-645:
-

TextIO passing the knowledge to the runner is the issue. Unless TextIO is 
passing the protocol even if one isn't specified, we'll have a problem. For 
example, Spark's default fs is hdfs:. I pass in a file of "input.txt". Should 
that be file: or hdfs:? If TextIO doesn't add the protocol, we'll have problems.

Yes, the Strata NYC demo is on the 27 at 9 ET. Running this demo with wordcount 
on Spark is my backup plan.

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-649) Pipeline "actions" should use foreachRDD via ParDo.

2016-09-20 Thread Amit Sela (JIRA)
Amit Sela created BEAM-649:
--

 Summary: Pipeline "actions" should use foreachRDD via ParDo.
 Key: BEAM-649
 URL: https://issues.apache.org/jira/browse/BEAM-649
 Project: Beam
  Issue Type: Bug
  Components: runner-spark
Reporter: Amit Sela
Assignee: Amit Sela


Spark will execute a pipeline ONLY if it's triggered by an action (batch) / 
output operation (streaming) - 
http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#output-operations-on-dstreams.

Currently, such actions in Beam are mostly implemented via ParDo, and 
translated by the runner as a Map transformation (via mapPartitions).

The runner overcomes this by "forcing" actions on untranslated leaves.
While this is OK, it would be better in some cases, e.g., Sinks, to apply the 
same ParDo translation but with foreach/foreachRDD instead of 
foreachPartition/mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-655) Rename @RunnableonService to something more descriptive

2016-09-20 Thread Jason Kuster (JIRA)
Jason Kuster created BEAM-655:
-

 Summary: Rename @RunnableonService to something more descriptive
 Key: BEAM-655
 URL: https://issues.apache.org/jira/browse/BEAM-655
 Project: Beam
  Issue Type: Bug
Reporter: Jason Kuster
Assignee: Jason Kuster






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-493) All Runners Run WordCount in Precommit

2016-09-20 Thread Jason Kuster (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Kuster closed BEAM-493.
-
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

All mainline runners now run wordcount in precommit

> All Runners Run WordCount in Precommit
> --
>
> Key: BEAM-493
> URL: https://issues.apache.org/jira/browse/BEAM-493
> Project: Beam
>  Issue Type: Improvement
>Reporter: Jason Kuster
>Assignee: Jason Kuster
> Fix For: 0.3.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[2/2] incubator-beam git commit: Closes #981

2016-09-20 Thread robertwb
Closes #981


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/701aff07
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/701aff07
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/701aff07

Branch: refs/heads/python-sdk
Commit: 701aff07430eb5c9b2395afe2c05f75473c0ceda
Parents: acd8d79 57a0b6a
Author: Robert Bradshaw 
Authored: Tue Sep 20 16:15:32 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Sep 20 16:15:32 2016 -0700

--
 sdks/python/apache_beam/io/iobase.py | 1 +
 1 file changed, 1 insertion(+)
--




[GitHub] incubator-beam pull request #981: Insert global windowing before write resul...

2016-09-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/981


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[1/2] incubator-beam git commit: Insert global windowing before write results GBK

2016-09-20 Thread robertwb
Repository: incubator-beam
Updated Branches:
  refs/heads/python-sdk acd8d7952 -> 701aff074


Insert global windowing before write results GBK

Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/57a0b6af
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/57a0b6af
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/57a0b6af

Branch: refs/heads/python-sdk
Commit: 57a0b6af5ef6bde2ba4bb88fe47997f86c5d5e25
Parents: acd8d79
Author: Robert Bradshaw 
Authored: Tue Sep 20 15:24:31 2016 -0700
Committer: GitHub 
Committed: Tue Sep 20 15:24:31 2016 -0700

--
 sdks/python/apache_beam/io/iobase.py | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/57a0b6af/sdks/python/apache_beam/io/iobase.py
--
diff --git a/sdks/python/apache_beam/io/iobase.py 
b/sdks/python/apache_beam/io/iobase.py
index ecb8a70..e1f364b 100644
--- a/sdks/python/apache_beam/io/iobase.py
+++ b/sdks/python/apache_beam/io/iobase.py
@@ -1032,6 +1032,7 @@ class WriteImpl(ptransform.PTransform):
   _WriteBundleDoFn(), self.sink,
   AsSingleton(init_result_coll))
| core.Map(lambda x: (None, x))
+   | core.WindowInto(window.GlobalWindows())
| core.GroupByKey()
| core.FlatMap(lambda x: x[1]))
 return do_once | core.FlatMap(



Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1179

2016-09-20 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-655) Rename @RunnableonService to something more descriptive

2016-09-20 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508687#comment-15508687
 ] 

Kenneth Knowles commented on BEAM-655:
--

Relevant proposal at [Capability Matrix 
Testing|https://docs.google.com/document/d/1fICxq32t9yWn9qXhmT07xpclHeHX2VlUyVtpi2WzzGM/edit?ts=575b4e86#heading=h.sxlquvlkbhl3].

> Rename @RunnableonService to something more descriptive
> ---
>
> Key: BEAM-655
> URL: https://issues.apache.org/jira/browse/BEAM-655
> Project: Beam
>  Issue Type: Bug
>Reporter: Jason Kuster
>Assignee: Jason Kuster
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1176

2016-09-20 Thread Apache Jenkins Server
See 




[jira] [Created] (BEAM-656) Add support for side inputs/outputs to Java 8 lambdas in MapElements

2016-09-20 Thread Jani Patokallio (JIRA)
Jani Patokallio created BEAM-656:


 Summary: Add support for side inputs/outputs to Java 8 lambdas in 
MapElements
 Key: BEAM-656
 URL: https://issues.apache.org/jira/browse/BEAM-656
 Project: Beam
  Issue Type: Improvement
  Components: sdk-ideas
 Environment: Java 8
Reporter: Jani Patokallio
Assignee: James Malone
Priority: Minor


Currently there's no way to use side inputs or outputs within Java 8 lambdas 
and MapElements.  It would be nice if you could do something like this:

PCollection wordLengths = words.apply(
MapElements.via((String word) -> {
int sideInput1= [[[ GetSideInputHere(); ]]]
[[[ SetSideOutputHere ]]] (sideInput1+word.length());
return word.length();
}).withOutputType(new TypeDescriptor() {});




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)