[jira] [Created] (BEAM-667) Include code snippets from real examples

2016-09-21 Thread Hadar Hod (JIRA)
Hadar Hod created BEAM-667:
--

 Summary: Include code snippets from real examples
 Key: BEAM-667
 URL: https://issues.apache.org/jira/browse/BEAM-667
 Project: Beam
  Issue Type: Sub-task
Reporter: Hadar Hod
Assignee: Hadar Hod






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-666) Add accurate "How to Run" instructions for each of the WC examples

2016-09-21 Thread Hadar Hod (JIRA)
Hadar Hod created BEAM-666:
--

 Summary: Add accurate "How to Run" instructions for each of the WC 
examples
 Key: BEAM-666
 URL: https://issues.apache.org/jira/browse/BEAM-666
 Project: Beam
  Issue Type: Sub-task
Reporter: Hadar Hod
Assignee: Hadar Hod






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-665) Copy prose (translate html to md, remove Dataflow references, etc.)

2016-09-21 Thread Hadar Hod (JIRA)
Hadar Hod created BEAM-665:
--

 Summary: Copy prose (translate html to md, remove Dataflow 
references, etc.)
 Key: BEAM-665
 URL: https://issues.apache.org/jira/browse/BEAM-665
 Project: Beam
  Issue Type: Sub-task
Reporter: Hadar Hod
Assignee: Hadar Hod






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-664) Port Dataflow SDK WordCount walkthrough to Beam site

2016-09-21 Thread Hadar Hod (JIRA)
Hadar Hod created BEAM-664:
--

 Summary: Port Dataflow SDK WordCount walkthrough to Beam site
 Key: BEAM-664
 URL: https://issues.apache.org/jira/browse/BEAM-664
 Project: Beam
  Issue Type: Task
  Components: website
Reporter: Hadar Hod
Assignee: Hadar Hod


Port the WordCount walkthrough from Dataflow docs to Beam website. 

* Copy prose (translate from html to md, remove Dataflow references, etc)
* Add accurate "How to Run" instructions for each of the WC examples
* Include code snippets from real examples





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #985: Support BigQuery DATE type

2016-09-21 Thread peihe
GitHub user peihe opened a pull request:

https://github.com/apache/incubator-beam/pull/985

Support BigQuery DATE type



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/peihe/incubator-beam bq-new-types

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/985.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #985


commit 6c998f1b9e91dc0dfbd4cdf9cc4bf46bae9ca1bd
Author: Pei He 
Date:   2016-09-22T02:56:01Z

Support BigQuery DATE type




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1185

2016-09-21 Thread Apache Jenkins Server
See 




[GitHub] incubator-beam pull request #984: Remove DataflowMatchers

2016-09-21 Thread tgroh
GitHub user tgroh opened a pull request:

https://github.com/apache/incubator-beam/pull/984

Remove DataflowMatchers

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

This is a Dataflow-specific requirement and should not be present within
the core Beam SDK.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgroh/incubator-beam rm_dataflow_matchers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/984.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #984


commit 56093398274d1b00738fa190ddde5d67b727610f
Author: Thomas Groh 
Date:   2016-09-22T01:16:45Z

Remove DataflowMatchers

This is a Dataflow-specific requirement and should not be present within
the core Beam SDK.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1184

2016-09-21 Thread Apache Jenkins Server
See 




[jira] [Updated] (BEAM-663) AvroSource should update progress within blocks

2016-09-21 Thread Daniel Halperin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Halperin updated BEAM-663:
-
Issue Type: Improvement  (was: Bug)

> AvroSource should update progress within blocks
> ---
>
> Key: BEAM-663
> URL: https://issues.apache.org/jira/browse/BEAM-663
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Currently Python AvroSource only updates progress at block boundaries.
> Java SDK updates this within blocks as well: 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BlockBasedSource.java#L208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-663) AvroSource should update progress within blocks

2016-09-21 Thread Daniel Halperin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Halperin updated BEAM-663:
-
Component/s: sdk-py

> AvroSource should update progress within blocks
> ---
>
> Key: BEAM-663
> URL: https://issues.apache.org/jira/browse/BEAM-663
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Currently Python AvroSource only updates progress at block boundaries.
> Java SDK updates this within blocks as well: 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BlockBasedSource.java#L208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-618) Python SDKs writes non RFC compliant JSON files for BQ Export

2016-09-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511685#comment-15511685
 ] 

ASF GitHub Bot commented on BEAM-618:
-

Github user ajamato closed the pull request at:

https://github.com/apache/incubator-beam/pull/947


> Python SDKs writes non RFC compliant JSON files for BQ Export
> -
>
> Key: BEAM-618
> URL: https://issues.apache.org/jira/browse/BEAM-618
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Alex Amato
>Assignee: Frances Perry
>
> Python SDK uses the built in json.dumps to write JSON files to GCS for the BQ 
> Exporter. BigQuery can fail to parse these files when it tries to load these 
> files into a BQ table because json.dumps can export JSON which does not 
> conform to the IEEE RFC.
> There are a few cases which are not RFC compilant listed in that module.
> https://docs.python.org/2/library/json.html#standard-compliance-and-interoperability
> The main issue we run into is the NAN, INF and -INF values.
> These fails with a confusing error (and we delete the GCS files making it 
> hard to debug):
> JSON table encountered too many errors, giving up. Rows JSON parsing error in 
> row starting at position
> We can set the allow_nan argument to json.dumps to false to address these 
> issues. So that when a user tries to write a file with INF, -INF or NAN
> Setting this argument will produce this type of error when json.dumps is 
> called with NAN/INF values. We may want to catch this error to mention the 
> fact that INF and NAN are not allowed.
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps
> sort_keys=sort_keys, **kw).encode(obj)
>   File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
> chunks = self.iterencode(o, _one_shot=True)
>   File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
> return _iterencode(o, 0)
> ValueError: Out of range float values are not JSON compliant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #947: [BEAM-618] Disallow NAN, INF and -INF inva...

2016-09-21 Thread ajamato
Github user ajamato closed the pull request at:

https://github.com/apache/incubator-beam/pull/947


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-663) AvroSource should update progress within blocks

2016-09-21 Thread Chamikara Jayalath (JIRA)
Chamikara Jayalath created BEAM-663:
---

 Summary: AvroSource should update progress within blocks
 Key: BEAM-663
 URL: https://issues.apache.org/jira/browse/BEAM-663
 Project: Beam
  Issue Type: Bug
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath


Currently Python AvroSource only updates progress at block boundaries.

Java SDK updates this within blocks as well: 
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BlockBasedSource.java#L208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[1/2] incubator-beam git commit: Closes #977

2016-09-21 Thread dhalperi
Repository: incubator-beam
Updated Branches:
  refs/heads/master 31a35e5c9 -> f62d04e22


Closes #977


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/f62d04e2
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/f62d04e2
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/f62d04e2

Branch: refs/heads/master
Commit: f62d04e22679ee2ac19e3ae37dec487d953d51c1
Parents: 31a35e5 c7e0010
Author: Dan Halperin 
Authored: Wed Sep 21 17:29:03 2016 -0700
Committer: Dan Halperin 
Committed: Wed Sep 21 17:29:03 2016 -0700

--
 .../beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java   | 10 ++
 .../sdk/io/gcp/bigquery/BigQueryTableRowIterator.java |  1 +
 .../sdk/io/gcp/bigquery/BigQueryAvroUtilsTest.java| 14 --
 .../io/gcp/bigquery/BigQueryTableRowIteratorTest.java | 13 ++---
 4 files changed, 33 insertions(+), 5 deletions(-)
--




[GitHub] incubator-beam pull request #977: Support BigQuery BYTES type

2016-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/977


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/2] incubator-beam git commit: Support BigQuery BYTES type

2016-09-21 Thread dhalperi
Support BigQuery BYTES type


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/c7e0010b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/c7e0010b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/c7e0010b

Branch: refs/heads/master
Commit: c7e0010b0d4a3c45148d05f5101f5310bb84c40c
Parents: 31a35e5
Author: Pei He 
Authored: Tue Sep 13 11:08:34 2016 -0700
Committer: Dan Halperin 
Committed: Wed Sep 21 17:29:03 2016 -0700

--
 .../beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java   | 10 ++
 .../sdk/io/gcp/bigquery/BigQueryTableRowIterator.java |  1 +
 .../sdk/io/gcp/bigquery/BigQueryAvroUtilsTest.java| 14 --
 .../io/gcp/bigquery/BigQueryTableRowIteratorTest.java | 13 ++---
 4 files changed, 33 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/c7e0010b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java
--
diff --git 
a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java
 
b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java
index 7826559..d9b5423 100644
--- 
a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java
+++ 
b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java
@@ -26,6 +26,9 @@ import com.google.api.services.bigquery.model.TableRow;
 import com.google.api.services.bigquery.model.TableSchema;
 import com.google.common.collect.ImmutableList;
 import com.google.common.collect.ImmutableMap;
+import com.google.common.io.BaseEncoding;
+
+import java.nio.ByteBuffer;
 import java.util.List;
 import javax.annotation.Nullable;
 import org.apache.avro.Schema;
@@ -153,6 +156,7 @@ class BigQueryAvroUtils {
 ImmutableMap fieldMap =
 ImmutableMap.builder()
 .put("STRING", Type.STRING)
+.put("BYTES", Type.BYTES)
 .put("INTEGER", Type.LONG)
 .put("FLOAT", Type.DOUBLE)
 .put("BOOLEAN", Type.BOOLEAN)
@@ -195,6 +199,12 @@ class BigQueryAvroUtils {
   case "RECORD":
 verify(v instanceof GenericRecord, "Expected GenericRecord, got %s", 
v.getClass());
 return convertGenericRecordToTableRow((GenericRecord) v, 
fieldSchema.getFields());
+  case "BYTES":
+verify(v instanceof ByteBuffer, "Expected ByteBuffer, got %s", 
v.getClass());
+ByteBuffer byteBuffer = (ByteBuffer) v;
+byte[] bytes = new byte[byteBuffer.limit()];
+byteBuffer.get(bytes);
+return BaseEncoding.base64().encode(bytes);
   default:
 throw new UnsupportedOperationException(
 String.format(

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/c7e0010b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableRowIterator.java
--
diff --git 
a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableRowIterator.java
 
b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableRowIterator.java
index 677c661..420f30c 100644
--- 
a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableRowIterator.java
+++ 
b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableRowIterator.java
@@ -253,6 +253,7 @@ class BigQueryTableRowIterator implements AutoCloseable {
   return BigQueryAvroUtils.formatTimestamp((String) v);
 }
 
+// Returns the original value for String and base64 encoded BYTES
 return v;
   }
 

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/c7e0010b/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtilsTest.java
--
diff --git 
a/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtilsTest.java
 
b/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtilsTest.java
index 59cf1f7..b9199b0 100644
--- 
a/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtilsTest.java
+++ 
b/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtilsTest.java
@@ -23,6 +23,9 @@ import 

[2/2] incubator-beam git commit: Closes #947

2016-09-21 Thread dhalperi
Closes #947


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/345fc698
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/345fc698
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/345fc698

Branch: refs/heads/python-sdk
Commit: 345fc698588631cfe056fa814a056d2bedc3054d
Parents: 701aff0 e173765
Author: Dan Halperin 
Authored: Wed Sep 21 15:43:06 2016 -0700
Committer: Dan Halperin 
Committed: Wed Sep 21 15:43:06 2016 -0700

--
 sdks/python/apache_beam/io/bigquery.py  | 24 +++
 sdks/python/apache_beam/io/bigquery_test.py | 37 
 2 files changed, 55 insertions(+), 6 deletions(-)
--




[1/2] incubator-beam git commit: Set allow_nan=False on bigquery JSON encoding

2016-09-21 Thread dhalperi
Repository: incubator-beam
Updated Branches:
  refs/heads/python-sdk 701aff074 -> 345fc6985


Set allow_nan=False on bigquery JSON encoding


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/e173765a
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/e173765a
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/e173765a

Branch: refs/heads/python-sdk
Commit: e173765adffa4ca13f495a3f7e9fbfc2127cdc1e
Parents: 701aff0
Author: Alex Amato 
Authored: Thu Sep 8 17:57:28 2016 -0700
Committer: Dan Halperin 
Committed: Wed Sep 21 15:42:58 2016 -0700

--
 sdks/python/apache_beam/io/bigquery.py  | 24 +++
 sdks/python/apache_beam/io/bigquery_test.py | 37 
 2 files changed, 55 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/e173765a/sdks/python/apache_beam/io/bigquery.py
--
diff --git a/sdks/python/apache_beam/io/bigquery.py 
b/sdks/python/apache_beam/io/bigquery.py
index 50c2eaf..5508eaa 100644
--- a/sdks/python/apache_beam/io/bigquery.py
+++ b/sdks/python/apache_beam/io/bigquery.py
@@ -136,6 +136,8 @@ __all__ = [
 'BigQuerySink',
 ]
 
+JSON_COMPLIANCE_ERROR = 'NAN, INF and -INF values are not JSON compliant.'
+
 
 class RowAsDictJsonCoder(coders.Coder):
   """A coder for a table row (represented as a dict) to/from a JSON string.
@@ -145,7 +147,14 @@ class RowAsDictJsonCoder(coders.Coder):
   """
 
   def encode(self, table_row):
-return json.dumps(table_row)
+# The normal error when dumping NAN/INF values is:
+# ValueError: Out of range float values are not JSON compliant
+# This code will catch this error to emit an error that explains
+# to the programmer that they have used NAN/INF values.
+try:
+  return json.dumps(table_row, allow_nan=False)
+except ValueError as e:
+  raise ValueError('%s. %s' % (e, JSON_COMPLIANCE_ERROR))
 
   def decode(self, encoded_table_row):
 return json.loads(encoded_table_row)
@@ -173,10 +182,14 @@ class TableRowJsonCoder(coders.Coder):
   raise AttributeError(
   'The TableRowJsonCoder requires a table schema for '
   'encoding operations. Please specify a table_schema argument.')
-return json.dumps(
-collections.OrderedDict(
-zip(self.field_names,
-[from_json_value(f.v) for f in table_row.f])))
+try:
+  return json.dumps(
+  collections.OrderedDict(
+  zip(self.field_names,
+  [from_json_value(f.v) for f in table_row.f])),
+  allow_nan=False)
+except ValueError as e:
+  raise ValueError('%s. %s' % (e, JSON_COMPLIANCE_ERROR))
 
   def decode(self, encoded_table_row):
 od = json.loads(
@@ -428,7 +441,6 @@ class BigQuerySink(iobase.NativeSink):
   fs['fields'] = schema_list_as_object(f.fields)
 fields.append(fs)
   return fields
-
 return json.dumps(
 {'fields': schema_list_as_object(self.table_schema.fields)})
 

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/e173765a/sdks/python/apache_beam/io/bigquery_test.py
--
diff --git a/sdks/python/apache_beam/io/bigquery_test.py 
b/sdks/python/apache_beam/io/bigquery_test.py
index 2bca0dc..09e64ce 100644
--- a/sdks/python/apache_beam/io/bigquery_test.py
+++ b/sdks/python/apache_beam/io/bigquery_test.py
@@ -40,6 +40,22 @@ class TestRowAsDictJsonCoder(unittest.TestCase):
 test_value = {'s': 'abc', 'i': 123, 'f': 123.456, 'b': True}
 self.assertEqual(test_value, coder.decode(coder.encode(test_value)))
 
+  def json_compliance_exception(self, value):
+with self.assertRaises(ValueError) as exn:
+  coder = RowAsDictJsonCoder()
+  test_value = {'s': value}
+  self.assertEqual(test_value, coder.decode(coder.encode(test_value)))
+  self.assertTrue(bigquery.JSON_COMPLIANCE_ERROR in exn.exception.message)
+
+  def test_invalid_json_nan(self):
+self.json_compliance_exception(float('nan'))
+
+  def test_invalid_json_inf(self):
+self.json_compliance_exception(float('inf'))
+
+  def test_invalid_json_neg_inf(self):
+self.json_compliance_exception(float('-inf'))
+
 
 class TestTableRowJsonCoder(unittest.TestCase):
 
@@ -71,6 +87,27 @@ class TestTableRowJsonCoder(unittest.TestCase):
 self.assertTrue(
 ctx.exception.message.startswith('The TableRowJsonCoder requires'))
 
+  def json_compliance_exception(self, value):
+with self.assertRaises(ValueError) as exn:
+  schema_definition = [('f', 'FLOAT')]
+  schema = bigquery.TableSchema(
+  

Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1183

2016-09-21 Thread Apache Jenkins Server
See 




[GitHub] incubator-beam pull request #983: [BEAM-657] Support Read.Bounded primitive

2016-09-21 Thread amitsela
GitHub user amitsela opened a pull request:

https://github.com/apache/incubator-beam/pull/983

[BEAM-657] Support Read.Bounded primitive

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amitsela/incubator-beam BEAM-657

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/983.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #983






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-662) SlidingWindows should support sub-second periods

2016-09-21 Thread Daniel Mills (JIRA)
Daniel Mills created BEAM-662:
-

 Summary: SlidingWindows should support sub-second periods
 Key: BEAM-662
 URL: https://issues.apache.org/jira/browse/BEAM-662
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Daniel Mills
Assignee: Frances Perry
Priority: Minor


SlidingWindows periods are being rounded to seconds, see 
http://stackoverflow.com/questions/39604646/can-slidingwindows-have-half-second-periods-in-python-apache-beam



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1182

2016-09-21 Thread Apache Jenkins Server
See 




[jira] [Reopened] (BEAM-661) CalendarWindows#isCompatibleWith should use equals instead of ==

2016-09-21 Thread Luke Cwik (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik reopened BEAM-661:

  Assignee: (was: Davor Bonaci)

> CalendarWindows#isCompatibleWith should use equals instead of ==
> 
>
> Key: BEAM-661
> URL: https://issues.apache.org/jira/browse/BEAM-661
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Ben Chambers
>Priority: Minor
> Fix For: Not applicable
>
>
> http://stackoverflow.com/questions/39617897/inputs-to-flatten-had-incompatible-window-windowfns-when-cogroupbykey-with-calen
> We're using `==` instead of `.equals` to compare objects, which causes 
> equivalent CalendarWindows to be incompatible.
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/CalendarWindows.java#L143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-660) CalendarWindows compares DateTimes with ==

2016-09-21 Thread Luke Cwik (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik resolved BEAM-660.

   Resolution: Duplicate
Fix Version/s: Not applicable

Duplicate of https://issues.apache.org/jira/browse/BEAM-661

> CalendarWindows compares DateTimes with ==
> --
>
> Key: BEAM-660
> URL: https://issues.apache.org/jira/browse/BEAM-660
> Project: Beam
>  Issue Type: Bug
>Reporter: Daniel Mills
>Priority: Minor
> Fix For: Not applicable
>
>
> CalendarWindows compares DateTime objects with ==, which causes compatible 
> WindowFns to not be considered compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-661) CalendarWindows#isCompatibleWith should use equals instead of ==

2016-09-21 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers closed BEAM-661.
-
   Resolution: Duplicate
Fix Version/s: Not applicable

> CalendarWindows#isCompatibleWith should use equals instead of ==
> 
>
> Key: BEAM-661
> URL: https://issues.apache.org/jira/browse/BEAM-661
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Ben Chambers
>Assignee: Davor Bonaci
>Priority: Minor
> Fix For: Not applicable
>
>
> http://stackoverflow.com/questions/39617897/inputs-to-flatten-had-incompatible-window-windowfns-when-cogroupbykey-with-calen
> We're using `==` instead of `.equals` to compare objects, which causes 
> equivalent CalendarWindows to be incompatible.
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/CalendarWindows.java#L143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-661) CalendarWindows#isCompatibleWith should use equals instead of ==

2016-09-21 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-661:
-

 Summary: CalendarWindows#isCompatibleWith should use equals 
instead of ==
 Key: BEAM-661
 URL: https://issues.apache.org/jira/browse/BEAM-661
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Ben Chambers
Assignee: Davor Bonaci
Priority: Minor


http://stackoverflow.com/questions/39617897/inputs-to-flatten-had-incompatible-window-windowfns-when-cogroupbykey-with-calen

We're using `==` instead of `.equals` to compare objects, which causes 
equivalent CalendarWindows to be incompatible.

https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/CalendarWindows.java#L143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-660) CalendarWindows compares DateTimes with ==

2016-09-21 Thread Daniel Mills (JIRA)
Daniel Mills created BEAM-660:
-

 Summary: CalendarWindows compares DateTimes with ==
 Key: BEAM-660
 URL: https://issues.apache.org/jira/browse/BEAM-660
 Project: Beam
  Issue Type: Bug
Reporter: Daniel Mills
Priority: Minor


CalendarWindows compares DateTime objects with ==, which causes compatible 
WindowFns to not be considered compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-613) SimpleStreamingWordCountTest tests only a single batch

2016-09-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510728#comment-15510728
 ] 

ASF GitHub Bot commented on BEAM-613:
-

GitHub user staslev opened a pull request:

https://github.com/apache/incubator-beam/pull/982

[BEAM-613] Revised SimpleStreamingWordCountTest to test multiple batches

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @amitsela 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staslev/incubator-beam 
BEAM-613-Improve-SimpleStreamingWordCountTest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/982.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #982


commit b2d84ae0b97220bd2d187d17db03a47c41128365
Author: Stas Levin 
Date:   2016-09-05T15:22:59Z

Revised the test to test multiple batches




> SimpleStreamingWordCountTest tests only a single batch
> --
>
> Key: BEAM-613
> URL: https://issues.apache.org/jira/browse/BEAM-613
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Stas Levin
>Assignee: Stas Levin
>
> {{org.apache.beam.runners.spark.translation.streaming.SimpleStreamingWordCountTest}}
>  aims to test a simple Spark streaming job, but only tests a single batch, 
> which is uncharacteristic of an actual (even simple) streaming job, usually 
> consisting of multiple batches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #982: [BEAM-613] Revised SimpleStreamingWordCoun...

2016-09-21 Thread staslev
GitHub user staslev opened a pull request:

https://github.com/apache/incubator-beam/pull/982

[BEAM-613] Revised SimpleStreamingWordCountTest to test multiple batches

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @amitsela 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staslev/incubator-beam 
BEAM-613-Improve-SimpleStreamingWordCountTest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/982.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #982


commit b2d84ae0b97220bd2d187d17db03a47c41128365
Author: Stas Levin 
Date:   2016-09-05T15:22:59Z

Revised the test to test multiple batches




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/3] incubator-beam git commit: TextIOTest: rewrite for code reuse and actually more test coverage

2016-09-21 Thread dhalperi
TextIOTest: rewrite for code reuse and actually more test coverage

We had missed a few empty file cases, etc.

Suggestions for improvement still requested. Sad I had to delete the 
TemporaryFolder @Rule but
could not see how to get one-time file creation with it.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/5f7153a3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/5f7153a3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/5f7153a3

Branch: refs/heads/master
Commit: 5f7153a3b81d13f9786d24174ba881fb79205784
Parents: 370d592
Author: Dan Halperin 
Authored: Tue Sep 20 10:26:55 2016 -0700
Committer: Dan Halperin 
Committed: Wed Sep 21 11:09:50 2016 -0700

--
 .../java/org/apache/beam/sdk/io/TextIOTest.java | 477 ---
 1 file changed, 199 insertions(+), 278 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/5f7153a3/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java
--
diff --git 
a/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java 
b/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java
index 49f5b16..fdfb652 100644
--- a/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java
+++ b/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOTest.java
@@ -21,8 +21,14 @@ import static org.apache.beam.sdk.TestUtils.INTS_ARRAY;
 import static org.apache.beam.sdk.TestUtils.LINES_ARRAY;
 import static org.apache.beam.sdk.TestUtils.NO_INTS_ARRAY;
 import static org.apache.beam.sdk.TestUtils.NO_LINES_ARRAY;
+import static org.apache.beam.sdk.io.TextIO.CompressionType.AUTO;
+import static org.apache.beam.sdk.io.TextIO.CompressionType.BZIP2;
+import static org.apache.beam.sdk.io.TextIO.CompressionType.GZIP;
+import static org.apache.beam.sdk.io.TextIO.CompressionType.UNCOMPRESSED;
+import static org.apache.beam.sdk.io.TextIO.CompressionType.ZIP;
 import static 
org.apache.beam.sdk.transforms.display.DisplayDataMatchers.hasDisplayItem;
 import static 
org.apache.beam.sdk.transforms.display.DisplayDataMatchers.hasValue;
+import static org.apache.beam.sdk.util.IOChannelUtils.resolve;
 import static org.hamcrest.Matchers.containsInAnyOrder;
 import static org.hamcrest.Matchers.equalTo;
 import static org.hamcrest.Matchers.greaterThan;
@@ -51,8 +57,12 @@ import java.io.PrintStream;
 import java.nio.channels.FileChannel;
 import java.nio.channels.SeekableByteChannel;
 import java.nio.charset.StandardCharsets;
+import java.nio.file.FileVisitResult;
 import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.SimpleFileVisitor;
 import java.nio.file.StandardOpenOption;
+import java.nio.file.attribute.BasicFileAttributes;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
@@ -87,13 +97,13 @@ import org.apache.beam.sdk.util.IOChannelUtils;
 import org.apache.beam.sdk.util.gcsfs.GcsPath;
 import org.apache.beam.sdk.values.PCollection;
 import 
org.apache.commons.compress.compressors.bzip2.BZip2CompressorOutputStream;
+import org.junit.AfterClass;
 import org.junit.BeforeClass;
 import org.junit.Ignore;
 import org.junit.Rule;
 import org.junit.Test;
 import org.junit.experimental.categories.Category;
 import org.junit.rules.ExpectedException;
-import org.junit.rules.TemporaryFolder;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
 import org.mockito.Mockito;
@@ -110,19 +120,93 @@ public class TextIOTest {
 
   private static final String MY_HEADER = "myHeader";
   private static final String MY_FOOTER = "myFooter";
+  private static final String[] EMPTY = new String[] {};
+  private static final String[] TINY =
+  new String[] {"Irritable eagle", "Optimistic jay", "Fanciful hawk"};
+  private static final String[] LARGE = makeLines(5000);
+
+  private static Path tempFolder;
+  private static File emptyTxt;
+  private static File tinyTxt;
+  private static File largeTxt;
+  private static File emptyGz;
+  private static File tinyGz;
+  private static File largeGz;
+  private static File emptyBzip2;
+  private static File tinyBzip2;
+  private static File largeBzip2;
+  private static File emptyZip;
+  private static File tinyZip;
+  private static File largeZip;
 
   @Rule
-  public TemporaryFolder tmpFolder = new TemporaryFolder();
-  @Rule
   public ExpectedException expectedException = ExpectedException.none();
 
+  private static File writeToFile(String[] lines, String filename, 
CompressionType compression)
+  throws IOException {
+File file = tempFolder.resolve(filename).toFile();
+OutputStream output = new FileOutputStream(file);
+switch (compression) {
+  case 

[1/3] incubator-beam git commit: Closes #980

2016-09-21 Thread dhalperi
Repository: incubator-beam
Updated Branches:
  refs/heads/master 1ceb12aeb -> 31a35e5c9


Closes #980


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/31a35e5c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/31a35e5c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/31a35e5c

Branch: refs/heads/master
Commit: 31a35e5c99056b0bd180623974a47d5c5400a395
Parents: 1ceb12a 5f7153a
Author: Dan Halperin 
Authored: Wed Sep 21 11:09:50 2016 -0700
Committer: Dan Halperin 
Committed: Wed Sep 21 11:09:50 2016 -0700

--
 .../apache/beam/sdk/io/CompressedSource.java|  10 +-
 .../java/org/apache/beam/sdk/io/TextIO.java |  44 +-
 .../java/org/apache/beam/sdk/io/TextIOTest.java | 516 ++-
 3 files changed, 284 insertions(+), 286 deletions(-)
--




[3/3] incubator-beam git commit: TextIO/CompressedSource: split AUTO mode files into bundles

2016-09-21 Thread dhalperi
TextIO/CompressedSource: split AUTO mode files into bundles


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/370d5924
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/370d5924
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/370d5924

Branch: refs/heads/master
Commit: 370d5924f393346115a22c23e5487f094847a783
Parents: 1ceb12a
Author: Dan Halperin 
Authored: Mon Sep 19 22:46:26 2016 -0700
Committer: Dan Halperin 
Committed: Wed Sep 21 11:09:50 2016 -0700

--
 .../apache/beam/sdk/io/CompressedSource.java| 10 +--
 .../java/org/apache/beam/sdk/io/TextIO.java | 44 +-
 .../java/org/apache/beam/sdk/io/TextIOTest.java | 85 +++-
 3 files changed, 108 insertions(+), 31 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/370d5924/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java
--
diff --git 
a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java 
b/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java
index 3cd097c..8a5fedd 100644
--- a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java
+++ b/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java
@@ -314,11 +314,11 @@ public class CompressedSource extends 
FileBasedSource {
   DecompressingChannelFactory channelFactory, String filePatternOrSpec, 
long minBundleSize,
   long startOffset, long endOffset) {
 super(filePatternOrSpec, minBundleSize, startOffset, endOffset);
-checkArgument(
-startOffset == 0,
-"CompressedSources must start reading at offset 0. Requested offset: " 
+ startOffset);
 this.sourceDelegate = sourceDelegate;
 this.channelFactory = channelFactory;
+checkArgument(
+isSplittable() || startOffset == 0,
+"CompressedSources must start reading at offset 0. Requested offset: " 
+ startOffset);
   }
 
   /**
@@ -339,7 +339,7 @@ public class CompressedSource extends FileBasedSource 
{
   @Override
   protected FileBasedSource createForSubrangeOfFile(String fileName, long 
start, long end) {
 return new 
CompressedSource<>(sourceDelegate.createForSubrangeOfFile(fileName, start, end),
-channelFactory, fileName, Long.MAX_VALUE, start, end);
+channelFactory, fileName, sourceDelegate.getMinBundleSize(), start, 
end);
   }
 
   /**
@@ -348,7 +348,7 @@ public class CompressedSource extends FileBasedSource 
{
* from the requested file name that the file is not compressed.
*/
   @Override
-  protected final boolean isSplittable() throws Exception {
+  protected final boolean isSplittable() {
 if (channelFactory instanceof FileNameBasedDecompressingChannelFactory) {
   FileNameBasedDecompressingChannelFactory fileNameBasedChannelFactory =
   (FileNameBasedDecompressingChannelFactory) channelFactory;

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/370d5924/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java
--
diff --git a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java 
b/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java
index 79967d1..62d3ae8 100644
--- a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java
+++ b/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java
@@ -19,6 +19,7 @@ package org.apache.beam.sdk.io;
 
 import static com.google.common.base.Preconditions.checkArgument;
 import static com.google.common.base.Preconditions.checkState;
+import static org.apache.beam.sdk.io.TextIO.CompressionType.UNCOMPRESSED;
 
 import com.google.common.annotations.VisibleForTesting;
 import com.google.protobuf.ByteString;
@@ -286,40 +287,35 @@ public class TextIO {
   }
 }
 
-// Create a source specific to the requested compression type.
-final Bounded read;
-switch(compressionType) {
+final Bounded read = org.apache.beam.sdk.io.Read.from(getSource());
+PCollection pcol = input.getPipeline().apply("Read", read);
+// Honor the default output coder that would have been used by this 
PTransform.
+pcol.setCoder(getDefaultOutputCoder());
+return pcol;
+  }
+
+  // Helper to create a source specific to the requested compression type.
+  protected FileBasedSource getSource() {
+switch (compressionType) {
   case UNCOMPRESSED:
-read = org.apache.beam.sdk.io.Read.from(
-new TextSource(filepattern, coder));
-break;
+return new TextSource(filepattern, coder);

[jira] [Updated] (BEAM-610) Enable spark's checkpointing mechanism for driver-failure recovery in streaming

2016-09-21 Thread Amit Sela (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Sela updated BEAM-610:
---
Fix Version/s: 0.3.0-incubating

> Enable spark's checkpointing mechanism for driver-failure recovery in 
> streaming
> ---
>
> Key: BEAM-610
> URL: https://issues.apache.org/jira/browse/BEAM-610
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Amit Sela
> Fix For: 0.3.0-incubating
>
>
> For streaming applications, Spark provides a checkpoint mechanism useful for 
> stateful processing and driver failures. See: 
> https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
> This requires the "lambdas", or the content of DStream/RDD functions to be 
> Serializable - currently, the runner a lot of the translation work in 
> streaming to the batch translator, which can no longer be the case because it 
> passes along non-serializables.
> This also requires wrapping the creation of the streaming application's graph 
> in a "getOrCreate" manner. See: 
> https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#how-to-configure-checkpointing
> Another limitation is the need to wrap Accumulators and Broadcast variables 
> in Singletons in order for them to be re-created once stale after recovery.
> This work is a prerequisite to support PerKey workflows, which will be 
> support via Spark's stateful operators such as mapWithState.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-610) Enable spark's checkpointing mechanism for driver-failure recovery in streaming

2016-09-21 Thread Amit Sela (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Sela resolved BEAM-610.

Resolution: Fixed

> Enable spark's checkpointing mechanism for driver-failure recovery in 
> streaming
> ---
>
> Key: BEAM-610
> URL: https://issues.apache.org/jira/browse/BEAM-610
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Amit Sela
> Fix For: 0.3.0-incubating
>
>
> For streaming applications, Spark provides a checkpoint mechanism useful for 
> stateful processing and driver failures. See: 
> https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
> This requires the "lambdas", or the content of DStream/RDD functions to be 
> Serializable - currently, the runner a lot of the translation work in 
> streaming to the batch translator, which can no longer be the case because it 
> passes along non-serializables.
> This also requires wrapping the creation of the streaming application's graph 
> in a "getOrCreate" manner. See: 
> https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#how-to-configure-checkpointing
> Another limitation is the need to wrap Accumulators and Broadcast variables 
> in Singletons in order for them to be re-created once stale after recovery.
> This work is a prerequisite to support PerKey workflows, which will be 
> support via Spark's stateful operators such as mapWithState.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #909: [BEAM-610] Enable spark's checkpointing me...

2016-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/909


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/4] incubator-beam git commit: [BEAM-610] Enable spark's checkpointing mechanism for driver-failure recovery in streaming.

2016-09-21 Thread amitsela
http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0feb6499/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java
--
diff --git 
a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java
 
b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java
index 8341c6d..1a0511f 100644
--- 
a/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java
+++ 
b/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java
@@ -19,39 +19,32 @@
 
 package org.apache.beam.runners.spark.translation;
 
+import static com.google.common.base.Preconditions.checkState;
 import static 
org.apache.beam.runners.spark.io.hadoop.ShardNameBuilder.getOutputDirectory;
 import static 
org.apache.beam.runners.spark.io.hadoop.ShardNameBuilder.getOutputFilePrefix;
 import static 
org.apache.beam.runners.spark.io.hadoop.ShardNameBuilder.getOutputFileTemplate;
 import static 
org.apache.beam.runners.spark.io.hadoop.ShardNameBuilder.replaceShardCount;
 
-import com.google.common.collect.ImmutableMap;
-import com.google.common.collect.Lists;
 import com.google.common.collect.Maps;
 import java.io.IOException;
-import java.io.Serializable;
-import java.lang.reflect.Field;
-import java.util.Arrays;
 import java.util.Collections;
-import java.util.List;
 import java.util.Map;
 import org.apache.avro.mapred.AvroKey;
 import org.apache.avro.mapreduce.AvroJob;
 import org.apache.avro.mapreduce.AvroKeyInputFormat;
 import org.apache.beam.runners.core.AssignWindowsDoFn;
-import org.apache.beam.runners.core.GroupAlsoByWindowsViaOutputBufferDoFn;
 import 
org.apache.beam.runners.core.GroupByKeyViaGroupByKeyOnly.GroupAlsoByWindow;
 import org.apache.beam.runners.core.GroupByKeyViaGroupByKeyOnly.GroupByKeyOnly;
-import org.apache.beam.runners.core.SystemReduceFn;
+import org.apache.beam.runners.spark.aggregators.AccumulatorSingleton;
+import org.apache.beam.runners.spark.aggregators.NamedAggregators;
 import org.apache.beam.runners.spark.coders.CoderHelpers;
 import org.apache.beam.runners.spark.io.hadoop.HadoopIO;
 import org.apache.beam.runners.spark.io.hadoop.ShardNameTemplateHelper;
 import org.apache.beam.runners.spark.io.hadoop.TemplatedAvroKeyOutputFormat;
 import org.apache.beam.runners.spark.io.hadoop.TemplatedTextOutputFormat;
 import org.apache.beam.runners.spark.util.BroadcastHelper;
-import org.apache.beam.runners.spark.util.ByteArray;
 import org.apache.beam.sdk.coders.CannotProvideCoderException;
 import org.apache.beam.sdk.coders.Coder;
-import org.apache.beam.sdk.coders.IterableCoder;
 import org.apache.beam.sdk.coders.KvCoder;
 import org.apache.beam.sdk.io.AvroIO;
 import org.apache.beam.sdk.io.TextIO;
@@ -63,36 +56,30 @@ import org.apache.beam.sdk.transforms.PTransform;
 import org.apache.beam.sdk.transforms.ParDo;
 import org.apache.beam.sdk.transforms.View;
 import org.apache.beam.sdk.transforms.windowing.BoundedWindow;
-import org.apache.beam.sdk.transforms.windowing.GlobalWindows;
 import org.apache.beam.sdk.transforms.windowing.Window;
 import org.apache.beam.sdk.transforms.windowing.WindowFn;
 import org.apache.beam.sdk.util.WindowedValue;
-import org.apache.beam.sdk.util.WindowedValue.WindowedValueCoder;
-import org.apache.beam.sdk.util.WindowingStrategy;
-import org.apache.beam.sdk.util.state.InMemoryStateInternals;
-import org.apache.beam.sdk.util.state.StateInternals;
-import org.apache.beam.sdk.util.state.StateInternalsFactory;
 import org.apache.beam.sdk.values.KV;
 import org.apache.beam.sdk.values.PCollection;
 import org.apache.beam.sdk.values.PCollectionList;
 import org.apache.beam.sdk.values.PCollectionTuple;
-import org.apache.beam.sdk.values.PCollectionView;
 import org.apache.beam.sdk.values.TupleTag;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.NullWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.spark.Accumulator;
 import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaRDDLike;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.api.java.function.Function;
-import org.apache.spark.api.java.function.Function2;
-import org.apache.spark.api.java.function.PairFlatMapFunction;
 import org.apache.spark.api.java.function.PairFunction;
+
 import scala.Tuple2;
 
+
 /**
  * Supports translation between a Beam transform, and Spark's operations on 
RDDs.
  */
@@ -101,31 +88,6 @@ public final class TransformTranslator {
   private TransformTranslator() {
   }
 
-  /**
-   * Getter of the field.
-   */
-  public static class FieldGetter {
-private final Map fields;
-
-public FieldGetter(Class clazz) {
- 

[1/4] incubator-beam git commit: [BEAM-610] Enable spark's checkpointing mechanism for driver-failure recovery in streaming.

2016-09-21 Thread amitsela
Repository: incubator-beam
Updated Branches:
  refs/heads/master 5c23f4954 -> 1ceb12aeb


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0feb6499/runners/spark/src/test/java/org/apache/beam/runners/spark/ClearAggregatorsRule.java
--
diff --git 
a/runners/spark/src/test/java/org/apache/beam/runners/spark/ClearAggregatorsRule.java
 
b/runners/spark/src/test/java/org/apache/beam/runners/spark/ClearAggregatorsRule.java
new file mode 100644
index 000..beaae13
--- /dev/null
+++ 
b/runners/spark/src/test/java/org/apache/beam/runners/spark/ClearAggregatorsRule.java
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.runners.spark;
+
+import org.apache.beam.runners.spark.aggregators.AccumulatorSingleton;
+import org.junit.rules.ExternalResource;
+
+/**
+ * A rule that clears the {@link 
org.apache.beam.runners.spark.aggregators.AccumulatorSingleton}
+ * which represents the Beam {@link 
org.apache.beam.sdk.transforms.Aggregator}s.
+ */
+class ClearAggregatorsRule extends ExternalResource {
+  @Override
+  protected void before() throws Throwable {
+AccumulatorSingleton.clear();
+  }
+}

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0feb6499/runners/spark/src/test/java/org/apache/beam/runners/spark/SimpleWordCountTest.java
--
diff --git 
a/runners/spark/src/test/java/org/apache/beam/runners/spark/SimpleWordCountTest.java
 
b/runners/spark/src/test/java/org/apache/beam/runners/spark/SimpleWordCountTest.java
index 8b7762f..238d7ba 100644
--- 
a/runners/spark/src/test/java/org/apache/beam/runners/spark/SimpleWordCountTest.java
+++ 
b/runners/spark/src/test/java/org/apache/beam/runners/spark/SimpleWordCountTest.java
@@ -29,6 +29,7 @@ import java.io.File;
 import java.util.Arrays;
 import java.util.List;
 import java.util.Set;
+
 import org.apache.beam.runners.spark.aggregators.metrics.sink.InMemoryMetrics;
 import org.apache.beam.runners.spark.examples.WordCount;
 import org.apache.beam.sdk.Pipeline;
@@ -53,6 +54,9 @@ public class SimpleWordCountTest {
   @Rule
   public ExternalResource inMemoryMetricsSink = new InMemoryMetricsSinkRule();
 
+  @Rule
+  public ClearAggregatorsRule clearAggregators = new ClearAggregatorsRule();
+
   private static final String[] WORDS_ARRAY = {
   "hi there", "hi", "hi sue bob",
   "hi sue", "", "bob hi"};

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0feb6499/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/SideEffectsTest.java
--
diff --git 
a/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/SideEffectsTest.java
 
b/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/SideEffectsTest.java
index 0d15d12..f85baab 100644
--- 
a/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/SideEffectsTest.java
+++ 
b/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/SideEffectsTest.java
@@ -67,8 +67,7 @@ public class SideEffectsTest implements Serializable {
 
   // TODO: remove the version check (and the setup and teardown methods) 
when we no
   // longer support Spark 1.3 or 1.4
-  String version = 
SparkContextFactory.getSparkContext(options.getSparkMaster(),
-  options.getAppName()).version();
+  String version = SparkContextFactory.getSparkContext(options).version();
   if (!version.startsWith("1.3.") && !version.startsWith("1.4.")) {
 assertTrue(e.getCause() instanceof UserException);
   }

http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0feb6499/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/streaming/FlattenStreamingTest.java
--
diff --git 
a/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/streaming/FlattenStreamingTest.java
 
b/runners/spark/src/test/java/org/apache/beam/runners/spark/translation/streaming/FlattenStreamingTest.java
index a6fe755..8210b0d 100644
--- 

[4/4] incubator-beam git commit: This closes #909

2016-09-21 Thread amitsela
This closes #909


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/1ceb12ae
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/1ceb12ae
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/1ceb12ae

Branch: refs/heads/master
Commit: 1ceb12aebd0ffa63bd28d31cbe830230713705ec
Parents: 5c23f49 0feb649
Author: Sela 
Authored: Wed Sep 21 20:17:38 2016 +0300
Committer: Sela 
Committed: Wed Sep 21 20:17:38 2016 +0300

--
 .../runners/spark/SparkPipelineOptions.java |  28 +-
 .../apache/beam/runners/spark/SparkRunner.java  | 121 ++--
 .../spark/aggregators/AccumulatorSingleton.java |  53 ++
 .../runners/spark/translation/DoFnFunction.java |  35 +-
 .../spark/translation/EvaluationContext.java|  17 +-
 .../translation/GroupCombineFunctions.java  | 262 +
 .../spark/translation/MultiDoFnFunction.java|  44 +-
 .../spark/translation/SparkContextFactory.java  |  48 +-
 .../translation/SparkPipelineEvaluator.java |  57 --
 .../translation/SparkPipelineTranslator.java|   5 +-
 .../spark/translation/SparkProcessContext.java  |  10 +-
 .../spark/translation/SparkRuntimeContext.java  |  44 +-
 .../spark/translation/TransformTranslator.java  | 473 +++-
 .../spark/translation/TranslationUtils.java | 195 +++
 .../SparkRunnerStreamingContextFactory.java |  98 
 .../streaming/StreamingEvaluationContext.java   |  44 +-
 .../streaming/StreamingTransformTranslator.java | 549 ---
 .../runners/spark/util/BroadcastHelper.java |  26 +
 .../runners/spark/ClearAggregatorsRule.java |  33 ++
 .../beam/runners/spark/SimpleWordCountTest.java |   4 +
 .../spark/translation/SideEffectsTest.java  |   3 +-
 .../streaming/FlattenStreamingTest.java |  54 +-
 .../streaming/KafkaStreamingTest.java   |  26 +-
 .../RecoverFromCheckpointStreamingTest.java | 179 ++
 .../streaming/SimpleStreamingWordCountTest.java |  25 +-
 .../utils/TestOptionsForStreaming.java  |  55 ++
 .../org/apache/beam/sdk/transforms/Combine.java |   7 +
 27 files changed, 1682 insertions(+), 813 deletions(-)
--




[3/4] incubator-beam git commit: [BEAM-610] Enable spark's checkpointing mechanism for driver-failure recovery in streaming.

2016-09-21 Thread amitsela
[BEAM-610] Enable spark's checkpointing mechanism for driver-failure recovery 
in streaming.

Refactor translation mechanism to support checkpointing of DStream.

Support basic functionality with GroupByKey and ParDo.

Added support for grouping operations.

Added checkpointDir option, using it before execution.

Support Accumulator recovery from checkpoint.

Streaming tests should use JUnit's TemporaryFolder Rule for checkpoint 
directory.

Support combine optimizations.

Support durable sideInput via Broadcast.

Branches in the pipeline are either Bounded or Unbounded and should be handles 
so.

Handle flatten/union of Bouned/Unbounded RDD/DStream.

JavaDoc

Rebased on master.

Reuse functionality between batch and streaming translators

Better implementation of streaming/batch pipeline-branch translation.

Move group/combine functions to their own wrapping class.

Fixed missing licenses.

Use VisibleForTesting annotation instead of comment.

Remove Broadcast failure recovery, to be handled separately.

Stop streaming gracefully, so any checkpointing will finish first.

typo + better documentation.

Validate checkpointDir durability.

Add checkpoint duration option.

A more compact streaming tests init with Rules.

A more accurate test, removed broadcast from test as it will be handeled 
separately.

Bounded/Unbounded translation to be handled by the SparkPipelineTranslator 
implementation. Evaluator
decides if translateBounded or translateUnbounded according to the visited 
node's boundeness.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/0feb6499
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/0feb6499
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/0feb6499

Branch: refs/heads/master
Commit: 0feb64994a05de4fe6b1ba178a38d03743b89b7a
Parents: 5c23f49
Author: Sela 
Authored: Thu Aug 25 23:49:01 2016 +0300
Committer: Sela 
Committed: Wed Sep 21 20:15:27 2016 +0300

--
 .../runners/spark/SparkPipelineOptions.java |  28 +-
 .../apache/beam/runners/spark/SparkRunner.java  | 121 ++--
 .../spark/aggregators/AccumulatorSingleton.java |  53 ++
 .../runners/spark/translation/DoFnFunction.java |  35 +-
 .../spark/translation/EvaluationContext.java|  17 +-
 .../translation/GroupCombineFunctions.java  | 262 +
 .../spark/translation/MultiDoFnFunction.java|  44 +-
 .../spark/translation/SparkContextFactory.java  |  48 +-
 .../translation/SparkPipelineEvaluator.java |  57 --
 .../translation/SparkPipelineTranslator.java|   5 +-
 .../spark/translation/SparkProcessContext.java  |  10 +-
 .../spark/translation/SparkRuntimeContext.java  |  44 +-
 .../spark/translation/TransformTranslator.java  | 473 +++-
 .../spark/translation/TranslationUtils.java | 195 +++
 .../SparkRunnerStreamingContextFactory.java |  98 
 .../streaming/StreamingEvaluationContext.java   |  44 +-
 .../streaming/StreamingTransformTranslator.java | 549 ---
 .../runners/spark/util/BroadcastHelper.java |  26 +
 .../runners/spark/ClearAggregatorsRule.java |  33 ++
 .../beam/runners/spark/SimpleWordCountTest.java |   4 +
 .../spark/translation/SideEffectsTest.java  |   3 +-
 .../streaming/FlattenStreamingTest.java |  54 +-
 .../streaming/KafkaStreamingTest.java   |  26 +-
 .../RecoverFromCheckpointStreamingTest.java | 179 ++
 .../streaming/SimpleStreamingWordCountTest.java |  25 +-
 .../utils/TestOptionsForStreaming.java  |  55 ++
 .../org/apache/beam/sdk/transforms/Combine.java |   7 +
 27 files changed, 1682 insertions(+), 813 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0feb6499/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
--
diff --git 
a/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
 
b/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
index db6b75c..7afb68c 100644
--- 
a/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
+++ 
b/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java
@@ -19,9 +19,9 @@
 package org.apache.beam.runners.spark;
 
 import com.fasterxml.jackson.annotation.JsonIgnore;
-
 import org.apache.beam.sdk.options.ApplicationNameOptions;
 import org.apache.beam.sdk.options.Default;
+import org.apache.beam.sdk.options.DefaultValueFactory;
 import org.apache.beam.sdk.options.Description;
 import org.apache.beam.sdk.options.PipelineOptions;
 import org.apache.beam.sdk.options.StreamingOptions;
@@ -48,6 +48,32 @@ public interface SparkPipelineOptions extends 

[jira] [Updated] (BEAM-656) Add support for side inputs/outputs to Java 8 lambdas in MapElements

2016-09-21 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-656:
-
Description: 
Currently there's no way to use side inputs or outputs within Java 8 lambdas 
and MapElements.  It would be nice if you could do something like this:

{code:java}
PCollection wordLengths = words.apply(
MapElements.via((String word) -> {
  int sideInput1= [[[ GetSideInputHere(); ]]]
  [[[ SetSideOutputHere ]]] (sideInput1+word.length());
  return word.length();
}).withOutputType(new TypeDescriptor() {});
{code}

  was:
Currently there's no way to use side inputs or outputs within Java 8 lambdas 
and MapElements.  It would be nice if you could do something like this:

PCollection wordLengths = words.apply(
MapElements.via((String word) -> {
int sideInput1= [[[ GetSideInputHere(); ]]]
[[[ SetSideOutputHere ]]] (sideInput1+word.length());
return word.length();
}).withOutputType(new TypeDescriptor() {});



> Add support for side inputs/outputs to Java 8 lambdas in MapElements
> 
>
> Key: BEAM-656
> URL: https://issues.apache.org/jira/browse/BEAM-656
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-ideas
> Environment: Java 8
>Reporter: Jani Patokallio
>Assignee: James Malone
>Priority: Minor
>
> Currently there's no way to use side inputs or outputs within Java 8 lambdas 
> and MapElements.  It would be nice if you could do something like this:
> {code:java}
> PCollection wordLengths = words.apply(
> MapElements.via((String word) -> {
>   int sideInput1= [[[ GetSideInputHere(); ]]]
>   [[[ SetSideOutputHere ]]] (sideInput1+word.length());
>   return word.length();
> }).withOutputType(new TypeDescriptor() {});
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-21 Thread Jesse Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510280#comment-15510280
 ] 

Jesse Anderson commented on BEAM-645:
-

They may well be. We can close for now. Just need to test this scenario once 
those bugs are fixed.

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-21 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509955#comment-15509955
 ] 

Amit Sela edited comment on BEAM-645 at 9/21/16 3:20 PM:
-

[~eljefe6aa] the SDK doesn't support TextIO with hdfs.
The SparkRunner doesn't support the Read.Bounded primitive.
Both are covered by BEAM-59 and BEAM-17 (now more accurately BEAM-657).

Any reason this is not a duplicate ?


was (Author: amitsela):
[~eljefe6aa] do you run this example with other runners where the input is on 
hdfs ?

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-21 Thread Jesse Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510270#comment-15510270
 ] 

Jesse Anderson commented on BEAM-645:
-

I don't have a Flink cluster to test this on.

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-658) Support Read.Unbounded primitive

2016-09-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510215#comment-15510215
 ] 

Jean-Baptiste Onofré commented on BEAM-658:
---

+1

IMHO, it would be great to implement it on master to be able to include it in 
the next release (on Spark 1.x runner).

> Support Read.Unbounded primitive
> 
>
> Key: BEAM-658
> URL: https://issues.apache.org/jira/browse/BEAM-658
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Amit Sela
>
> Spark runner support for Read.Unbounded primitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS

2016-09-21 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509955#comment-15509955
 ] 

Amit Sela commented on BEAM-645:


[~eljefe6aa] do you run this example with other runners where the input is on 
hdfs ?

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -
>
> Key: BEAM-645
> URL: https://issues.apache.org/jira/browse/BEAM-645
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 0.3.0-incubating
>Reporter: Jesse Anderson
>Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner 
> uses the input file in HDFS. When the program performs its startup checks, it 
> looks for the file in the local filesystem.
> To workaround this issue, you have to create a file in the local filesystem 
> and put the actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to 
> find any files matching Macbeth.txt
>   at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
>   at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
>   at 
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
>   at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
>   at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
>   at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
>   at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
>   at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
>   at org.apache.beam.examples.WordCount.main(WordCount.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1181

2016-09-21 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-658) Support Read.Unbounded primitive

2016-09-21 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509674#comment-15509674
 ] 

Amit Sela commented on BEAM-658:


Suggested design: 
https://docs.google.com/document/d/12BzHbETDt7ICIF7vc8zzCeLllmIpvvaVDIdBlcIwE1M/edit?usp=sharing.

> Support Read.Unbounded primitive
> 
>
> Key: BEAM-658
> URL: https://issues.apache.org/jira/browse/BEAM-658
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Amit Sela
>
> Spark runner support for Read.Unbounded primitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-658) Support Read.Unbounded primitive

2016-09-21 Thread Amit Sela (JIRA)
Amit Sela created BEAM-658:
--

 Summary: Support Read.Unbounded primitive
 Key: BEAM-658
 URL: https://issues.apache.org/jira/browse/BEAM-658
 Project: Beam
  Issue Type: Sub-task
  Components: runner-spark
Reporter: Amit Sela
Assignee: Amit Sela


Spark runner support for Read.Unbounded primitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-657) Support Read.Bounded primitive

2016-09-21 Thread Amit Sela (JIRA)
Amit Sela created BEAM-657:
--

 Summary: Support Read.Bounded primitive
 Key: BEAM-657
 URL: https://issues.apache.org/jira/browse/BEAM-657
 Project: Beam
  Issue Type: Sub-task
  Components: runner-spark
Reporter: Amit Sela
Assignee: Amit Sela


Spark runner support for Beam's primitive Read.Bounded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-17) Add support for new Beam Source API

2016-09-21 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-17?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509658#comment-15509658
 ] 

Amit Sela commented on BEAM-17:
---

Same for streaming, with UnboundedSource, but the challenge there is greater 
since a "checkpointing" mechanism needs to be available on the workers.

> Add support for new Beam Source API
> ---
>
> Key: BEAM-17
> URL: https://issues.apache.org/jira/browse/BEAM-17
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Amit Sela
>
> The API is discussed in 
> https://cloud.google.com/dataflow/model/sources-and-sinks#creating-sources
> To implement this, we need to add support for 
> com.google.cloud.dataflow.sdk.io.Read in TransformTranslator. This can be 
> done by creating a new SourceInputFormat class that translates from a DF 
> Source to a Hadoop InputFormat. The two concepts are pretty-well aligned 
> since they both have the concept of splits and readers.
> Note that when there's a native HadoopSource in DF, it will need 
> special-casing in the code for Read since we'll be able to use the underlying 
> InputFormat directly.
> This could be tested using XmlSource from the SDK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (BEAM-17) Add support for new Beam Source API

2016-09-21 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-17?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348444#comment-15348444
 ] 

Amit Sela edited comment on BEAM-17 at 9/21/16 11:44 AM:
-

Since the SDK has moved to the Apache Incubator, it now provides a Runner API 
and "matching" InputFormats doesn't seem like the "Beam way".
Instead, it seems that a solution in the form of creating an RDD backed by 
BoundedSource makes more sense.


was (Author: amitsela):
Implementing for next gen. Spark runner - Spark 2.x

> Add support for new Beam Source API
> ---
>
> Key: BEAM-17
> URL: https://issues.apache.org/jira/browse/BEAM-17
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Amit Sela
>
> The API is discussed in 
> https://cloud.google.com/dataflow/model/sources-and-sinks#creating-sources
> To implement this, we need to add support for 
> com.google.cloud.dataflow.sdk.io.Read in TransformTranslator. This can be 
> done by creating a new SourceInputFormat class that translates from a DF 
> Source to a Hadoop InputFormat. The two concepts are pretty-well aligned 
> since they both have the concept of splits and readers.
> Note that when there's a native HadoopSource in DF, it will need 
> special-casing in the code for Read since we'll be able to use the underlying 
> InputFormat directly.
> This could be tested using XmlSource from the SDK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is still unstable: beam_PostCommit_RunnableOnService_GoogleCloudDataflow #1180

2016-09-21 Thread Apache Jenkins Server
See