[jira] [Commented] (BEAM-2574) test unsupported/invalid cases in DSL

2017-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083436#comment-16083436
 ] 

ASF GitHub Bot commented on BEAM-2574:
--

Github user XuMingmin closed the pull request at:

https://github.com/apache/beam/pull/3530


> test unsupported/invalid cases in DSL
> -
>
> Key: BEAM-2574
> URL: https://issues.apache.org/jira/browse/BEAM-2574
> Project: Beam
>  Issue Type: Task
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: dsl_sql_merge
>
> add test cases to cover the scenarios which are not supported, or have 
> invalid usages.
> Note, the previous failure of test is because of {{TestPipeline}} is marked 
> as {{ClassRule}}, it works with {{Rule}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3530: [BEAM-2574] test unsupported/invalid cases in DSL

2017-07-11 Thread XuMingmin
Github user XuMingmin closed the pull request at:

https://github.com/apache/beam/pull/3530


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/2] beam git commit: [BEAM-2574] This closes #3530

2017-07-11 Thread takidau
[BEAM-2574] This closes #3530


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/b8fa0add
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/b8fa0add
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/b8fa0add

Branch: refs/heads/DSL_SQL
Commit: b8fa0addc86d2249cd2fbbce3d8ad98d27e04604
Parents: ca2bc72 794f190
Author: Tyler Akidau 
Authored: Tue Jul 11 21:05:15 2017 -0700
Committer: Tyler Akidau 
Committed: Tue Jul 11 21:05:15 2017 -0700

--
 .../dsls/sql/BeamSqlDslAggregationTest.java | 30 ++
 .../apache/beam/dsls/sql/BeamSqlDslBase.java| 17 +---
 .../beam/dsls/sql/BeamSqlDslFilterTest.java | 41 
 .../beam/dsls/sql/BeamSqlDslProjectTest.java| 15 +++
 4 files changed, 98 insertions(+), 5 deletions(-)
--




[1/2] beam git commit: Test unsupported/invalid cases in DSL tests.

2017-07-11 Thread takidau
Repository: beam
Updated Branches:
  refs/heads/DSL_SQL ca2bc723d -> b8fa0addc


Test unsupported/invalid cases in DSL tests.


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/794f1901
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/794f1901
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/794f1901

Branch: refs/heads/DSL_SQL
Commit: 794f1901dcf5fb520a849adce7ce436e8b2f8535
Parents: ca2bc72
Author: mingmxu 
Authored: Sun Jul 9 22:26:29 2017 -0700
Committer: Tyler Akidau 
Committed: Tue Jul 11 21:04:07 2017 -0700

--
 .../dsls/sql/BeamSqlDslAggregationTest.java | 30 ++
 .../apache/beam/dsls/sql/BeamSqlDslBase.java| 17 +---
 .../beam/dsls/sql/BeamSqlDslFilterTest.java | 41 
 .../beam/dsls/sql/BeamSqlDslProjectTest.java| 15 +++
 4 files changed, 98 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/794f1901/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java
--
diff --git 
a/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java
 
b/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java
index b0509ae..f92c803 100644
--- 
a/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java
+++ 
b/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java
@@ -257,4 +257,34 @@ public class BeamSqlDslAggregationTest extends 
BeamSqlDslBase {
 
 pipeline.run().waitUntilFinish();
   }
+
+  @Test
+  public void testWindowOnNonTimestampField() throws Exception {
+exceptions.expect(IllegalStateException.class);
+exceptions.expectMessage(
+"Cannot apply 'TUMBLE' to arguments of type 'TUMBLE(, 
)'");
+pipeline.enableAbandonedNodeEnforcement(false);
+
+String sql = "SELECT f_int2, COUNT(*) AS `size` FROM TABLE_A "
++ "GROUP BY f_int2, TUMBLE(f_long, INTERVAL '1' HOUR)";
+PCollection result =
+PCollectionTuple.of(new TupleTag("TABLE_A"), inputA1)
+.apply("testWindowOnNonTimestampField", BeamSql.query(sql));
+
+pipeline.run().waitUntilFinish();
+  }
+
+  @Test
+  public void testUnsupportedDistinct() throws Exception {
+exceptions.expect(IllegalStateException.class);
+exceptions.expectMessage("Encountered \"*\"");
+pipeline.enableAbandonedNodeEnforcement(false);
+
+String sql = "SELECT f_int2, COUNT(DISTINCT *) AS `size` FROM PCOLLECTION 
GROUP BY f_int2";
+
+PCollection result =
+inputA1.apply("testUnsupportedDistinct", BeamSql.simpleQuery(sql));
+
+pipeline.run().waitUntilFinish();
+  }
 }

http://git-wip-us.apache.org/repos/asf/beam/blob/794f1901/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslBase.java
--
diff --git 
a/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslBase.java 
b/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslBase.java
index d62bdc4..308dcb6 100644
--- a/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslBase.java
+++ b/dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslBase.java
@@ -31,8 +31,10 @@ import org.apache.beam.sdk.testing.TestPipeline;
 import org.apache.beam.sdk.transforms.Create;
 import org.apache.beam.sdk.values.PBegin;
 import org.apache.beam.sdk.values.PCollection;
+import org.junit.Before;
 import org.junit.BeforeClass;
-import org.junit.ClassRule;
+import org.junit.Rule;
+import org.junit.rules.ExpectedException;
 
 /**
  * prepare input records to test {@link BeamSql}.
@@ -43,14 +45,16 @@ import org.junit.ClassRule;
 public class BeamSqlDslBase {
   public static final DateFormat FORMAT = new SimpleDateFormat("-MM-dd 
HH:mm:ss");
 
-  @ClassRule
-  public static TestPipeline pipeline = TestPipeline.create();
+  @Rule
+  public final TestPipeline pipeline = TestPipeline.create();
+  @Rule
+  public ExpectedException exceptions = ExpectedException.none();
 
   public static BeamSqlRecordType recordTypeInTableA;
   public static List recordsInTableA;
 
-  public static PCollection inputA1;
-  public static PCollection inputA2;
+  public PCollection inputA1;
+  public PCollection inputA2;
 
   @BeforeClass
   public static void prepareClass() throws ParseException {
@@ -61,7 +65,10 @@ public class BeamSqlDslBase {
 Types.DOUBLE, Types.VARCHAR, Types.TIMESTAMP, Types.INTEGER));
 
 recordsInTableA = prepareInputRecordsInTableA();
+  }
 
+  @Before
+  public void preparePCollections(){
 inputA1 = PBegin.in(pipeline).apply("inputA1", Create.of(recordsInTableA)
 .withCoder(new BeamSqlRowCoder(recordTypeInTableA)));
 


Jenkins build is back to normal : beam_PostCommit_Python_Verify #2711

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Created] (BEAM-2601) FileBasedSink produces incorrect shards when writing to multiple destinations

2017-07-11 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-2601:


 Summary: FileBasedSink produces incorrect shards when writing to 
multiple destinations
 Key: BEAM-2601
 URL: https://issues.apache.org/jira/browse/BEAM-2601
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Davor Bonaci
 Fix For: 2.2.0


FileBasedSink now supports multiple dynamic destinations, however it finalizes 
all files in a bundle without paying attention to destination. This means that 
the shard counts will be incorrect across these destinations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2602) Fully support dropwizard metrics in beam and runners

2017-07-11 Thread Cody (JIRA)
Cody created BEAM-2602:
--

 Summary: Fully support dropwizard metrics in beam and runners
 Key: BEAM-2602
 URL: https://issues.apache.org/jira/browse/BEAM-2602
 Project: Beam
  Issue Type: Improvement
  Components: runner-core, runner-flink, runner-spark, sdk-java-core
Affects Versions: Not applicable
Reporter: Cody
Assignee: Kenneth Knowles
 Fix For: 2.2.0


As proposed at 
https://docs.google.com/document/d/1-35iyCIJ9P4EQONlakgXBFRGUYoOLanq2Uf2sw5EjJw/edit?usp=sharing
 , I'd like to add full support of dropwizard metrics by delegating beam 
metrics to runners.

The proposal involves a few subtasks, as far as I see, including:
1. add Meter interface in sdk-java-core and extend Distribution to support 
quantiles
2. add MeterData, extend DistributionData. Merge 
{Counter/Meter/Gauge/Distribution}Data instead of 
Counter/Meter/Gauge/Distribution at accumulators.
3. Runner changes over improved metrics. 

I will create subtasks later if there's no objection.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2600) Artifact for Python SDK harness that can be referenced in pipeline definition

2017-07-11 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083343#comment-16083343
 ] 

Ahmet Altay commented on BEAM-2600:
---

cc: [~tvalentyn]

We might consider generalizing the dataflow runner option 
{{worker_harness_container_image}}. Is there an equivalent for this today for 
?

> Artifact for Python SDK harness that can be referenced in pipeline definition
> -
>
> Key: BEAM-2600
> URL: https://issues.apache.org/jira/browse/BEAM-2600
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Kenneth Knowles
>Assignee: Ahmet Altay
>  Labels: beam-python-everywhere
>
> In order to build a pipeline that invokes a Python UDF, we need to be able to 
> construct something like this:
> {code}
> SdkFunctionSpec {
>   environment = ,
>   spec = {
> urn = ,
> data = 
>   }
> }
> {code}
> I do not know that there exists anything we can put for " harness>" today. For prototyping, it could be just a symbol that runners have 
> to know. But eventually it should be something that runners can instantiate 
> without knowing anything about the SDK that put it there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2600) Artifact for Python SDK harness that can be referenced in pipeline definition

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-2600:
--
Description: 
In order to build a pipeline that invokes a Python UDF, we need to be able to 
construct something like this:

{code}
SdkFunctionSpec {
  environment = ,
  spec = {
urn = ,
data = 
  }
}
{code}

I could be out of date, but based on a couple of conversations I do not know 
that there exists anything we can put for "" today. For 
prototyping, it could be just a symbol that runners have to know. But 
eventually it should be something that runners can instantiate without knowing 
anything about the SDK that put it there. I imagine it may encompass "custom 
containers" eventually, though that doesn't block anything immediately.

  was:
In order to build a pipeline that invokes a Python UDF, we need to be able to 
construct something like this:

{code}
SdkFunctionSpec {
  environment = ,
  spec = {
urn = ,
data = 
  }
}
{code}

I do not know that there exists anything we can put for "" 
today. For prototyping, it could be just a symbol that runners have to know. 
But eventually it should be something that runners can instantiate without 
knowing anything about the SDK that put it there.


> Artifact for Python SDK harness that can be referenced in pipeline definition
> -
>
> Key: BEAM-2600
> URL: https://issues.apache.org/jira/browse/BEAM-2600
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Kenneth Knowles
>Assignee: Ahmet Altay
>  Labels: beam-python-everywhere
>
> In order to build a pipeline that invokes a Python UDF, we need to be able to 
> construct something like this:
> {code}
> SdkFunctionSpec {
>   environment = ,
>   spec = {
> urn = ,
> data = 
>   }
> }
> {code}
> I could be out of date, but based on a couple of conversations I do not know 
> that there exists anything we can put for "" today. For 
> prototyping, it could be just a symbol that runners have to know. But 
> eventually it should be something that runners can instantiate without 
> knowing anything about the SDK that put it there. I imagine it may encompass 
> "custom containers" eventually, though that doesn't block anything 
> immediately.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2601) FileBasedSink produces incorrect shards when writing to multiple destinations

2017-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083336#comment-16083336
 ] 

ASF GitHub Bot commented on BEAM-2601:
--

GitHub user reuvenlax opened a pull request:

https://github.com/apache/beam/pull/3546

[BEAM-2601] Fix broken per-destination finalization.

We now finalize each destination separately. Since temporary-file cleanup 
happens (for the non-windowed case) by deleting the entire temp directory, we 
delay cleanup until all destinations have been finalized.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reuvenlax/incubator-beam 
fix_dynamic_destination_sharding

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3546


commit 7254dcf312251a462c70b24c8ff8861c2f009b2b
Author: Reuven Lax 
Date:   2017-07-12T02:19:42Z

Fix per-destination finalization.




> FileBasedSink produces incorrect shards when writing to multiple destinations
> -
>
> Key: BEAM-2601
> URL: https://issues.apache.org/jira/browse/BEAM-2601
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Davor Bonaci
> Fix For: 2.2.0
>
>
> FileBasedSink now supports multiple dynamic destinations, however it 
> finalizes all files in a bundle without paying attention to destination. This 
> means that the shard counts will be incorrect across these destinations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3546: [BEAM-2601] Fix broken per-destination finalization...

2017-07-11 Thread reuvenlax
GitHub user reuvenlax opened a pull request:

https://github.com/apache/beam/pull/3546

[BEAM-2601] Fix broken per-destination finalization.

We now finalize each destination separately. Since temporary-file cleanup 
happens (for the non-windowed case) by deleting the entire temp directory, we 
delay cleanup until all destinations have been finalized.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reuvenlax/incubator-beam 
fix_dynamic_destination_sharding

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3546


commit 7254dcf312251a462c70b24c8ff8861c2f009b2b
Author: Reuven Lax 
Date:   2017-07-12T02:19:42Z

Fix per-destination finalization.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-2598) SparkRunner Fn API based ParDo operator

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2598:
-

 Summary: SparkRunner Fn API based ParDo operator
 Key: BEAM-2598
 URL: https://issues.apache.org/jira/browse/BEAM-2598
 Project: Beam
  Issue Type: New Feature
  Components: runner-spark
Reporter: Kenneth Knowles
Assignee: Amit Sela


To run non-Java SDK code is to put together an operator that manages a Fn API 
client DoFnRunner and an SDK harness Fn API server.

(filing to organize steps, details of this may evolve as it is implemented)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2597) FlinkRunner Fn API based ParDo operator

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2597:
-

 Summary: FlinkRunner Fn API based ParDo operator
 Key: BEAM-2597
 URL: https://issues.apache.org/jira/browse/BEAM-2597
 Project: Beam
  Issue Type: New Feature
  Components: runner-flink
Reporter: Kenneth Knowles
Assignee: Aljoscha Krettek


To run non-Java SDK code is to put together an operator that manages a Fn API 
client DoFnRunner and an SDK harness Fn API server.

(filing to organize steps, details of this may evolve as it is implemented)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2600) Artifact for Python SDK harness that can be referenced in pipeline definition

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2600:
-

 Summary: Artifact for Python SDK harness that can be referenced in 
pipeline definition
 Key: BEAM-2600
 URL: https://issues.apache.org/jira/browse/BEAM-2600
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py
Reporter: Kenneth Knowles
Assignee: Ahmet Altay


In order to build a pipeline that invokes a Python UDF, we need to be able to 
construct something like this:

{code}
SdkFunctionSpec {
  environment = ,
  spec = {
urn = ,
data = 
  }
}
{code}

I do not know that there exists anything we can put for "" 
today. For prototyping, it could be just a symbol that runners have to know. 
But eventually it should be something that runners can instantiate without 
knowing anything about the SDK that put it there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2599) ApexRunner Fn API based ParDo operator

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2599:
-

 Summary: ApexRunner Fn API based ParDo operator
 Key: BEAM-2599
 URL: https://issues.apache.org/jira/browse/BEAM-2599
 Project: Beam
  Issue Type: New Feature
  Components: runner-apex
Reporter: Kenneth Knowles


To run non-Java SDK code is to put together an operator that manages a Fn API 
client DoFnRunner and an SDK harness Fn API server.

(filing to organize steps, details of this may evolve as it is implemented)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-11 Thread Thomas Groh (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Groh reassigned BEAM-2595:
-

   Assignee: Ahmet Altay  (was: Thomas Groh)
Component/s: (was: runner-dataflow)
 sdk-py

> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Ahmet Altay
>Priority: Minor
>  Labels: gcp
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-11 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083300#comment-16083300
 ] 

Ahmet Altay commented on BEAM-2595:
---

[~andrea.pierleoni] Thank you for reporting this. Could you share the error you 
are getting?

[~sb2nov] Could you verify whether this is a regression or not? If this is a 
regression, can we mitigate before (add a comment/document to use the old way) 
before the release goes out?

In addition to fix, I agree that we need a test if we don't have one. And also 
update examples (e.g. 
https://github.com/apache/beam/blob/91c7d3d1f7d72e84e773c1adbffed063aefdff3b/sdks/python/apache_beam/examples/cookbook/bigquery_schema.py#L116)

cc: [~chamikara]


> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-11 Thread Ahmet Altay (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Altay reassigned BEAM-2595:
-

Assignee: Sourabh Bajaj  (was: Ahmet Altay)

> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-11 Thread Ahmet Altay (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Altay updated BEAM-2595:
--
Fix Version/s: 2.1.0

> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Spark #2612

2017-07-11 Thread Apache Jenkins Server
See 




Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Flink #3378

2017-07-11 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : beam_PostCommit_Java_MavenInstall #4355

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Dmitry Demeshchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083268#comment-16083268
 ] 

Dmitry Demeshchuk commented on BEAM-2572:
-

Nevermind my last comment, credentials would have to be provider specific, 
which would indeed turn the interface into a {{ReadFromText('s3://somewhere', 
aws_credentials=..., gcp_credentials=..., azure_credentials=..., 
ftp_credentials=...)}} monstrosity.

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Build failed in Jenkins: beam_PostCommit_Python_Verify #2710

2017-07-11 Thread Apache Jenkins Server
See 


Changes:

[robertwb] Cleanup and fix ptransform_fn decorator.

--
[...truncated 39.75 KB...]
ERROR: invocation failed (exit code 1), logfile: 

ERROR: actionid: py27cython
msg: installpkg
cmdargs: 
[local('
 'install', 
'
env: {'JENKINS_HOME': '/x1/jenkins/jenkins-home', 'BUILD_CAUSE': 'SCMTRIGGER', 
'GIT_COMMIT': '91c7d3d1f7d72e84e773c1adbffed063aefdff3b', 'HUDSON_URL': 
'https://builds.apache.org/', 'BUILD_URL': 
'https://builds.apache.org/job/beam_PostCommit_Python_Verify/2710/', 
'GIT_PREVIOUS_COMMIT': '84682109b9ff186f8c850a0b71de5163949c2769', 'BUILD_TAG': 
'jenkins-beam_PostCommit_Python_Verify-2710', 'SSH_CLIENT': '140.211.11.14 
46584 22', 'JENKINS_URL': 'https://builds.apache.org/', 'LOGNAME': 'jenkins', 
'USER': 'jenkins', 'WORKSPACE': 
' 'HOME': 
'/home/jenkins', 'PATH': 
':/home/jenkins/tools/java/latest1.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games',
 'JOB_NAME': 'beam_PostCommit_Python_Verify', 'LANG': 'en_US.UTF-8', 
'RUN_DISPLAY_URL': 
'https://builds.apache.org/job/beam_PostCommit_Python_Verify/2710/display/redirect',
 'VIRTUAL_ENV': 
'
 'SHELL': '/bin/bash', 'GIT_PREVIOUS_SUCCESSFUL_COMMIT': 
'84682109b9ff186f8c850a0b71de5163949c2769', 'SHLVL': '3', 'NLSPATH': 
'/usr/dt/lib/nls/msg/%L/%N.cat', 'GIT_AUTHOR_EMAIL': 'bui...@apache.org', 
'HUDSON_HOME': '/x1/jenkins/jenkins-home', 'NODE_LABELS': 'beam beam3', 
'GIT_COMMITTER_EMAIL': 'bui...@apache.org', 'PYTHONHASHSEED': '3734765279', 
'JAVA_HOME': '/home/jenkins/tools/java/latest1.8', 'ROOT_BUILD_CAUSE': 
'SCMTRIGGER', 'BUILD_ID': '2710', 'BUILD_NUMBER': '2710', 'XDG_RUNTIME_DIR': 
'/run/user/9997', 'HUDSON_COOKIE': '598f2752-d067-4956-80a0-6681c48cc6c9', 
'JOB_URL': 'https://builds.apache.org/job/beam_PostCommit_Python_Verify/', 
'TEST_HOST': 'beam3', 'GIT_BRANCH': 'origin/master', 'JOB_BASE_NAME': 
'beam_PostCommit_Python_Verify', 'GIT_AUTHOR_NAME': 'jenkins', 
'GIT_COMMITTER_NAME': 'jenkins', 'XDG_SESSION_ID': '49', '_': 
'/home/jenkins/.local/bin//tox', 'sha1': 'master', 'COVERALLS_REPO_TOKEN': 
'', 'JOB_DISPLAY_URL': 
'https://builds.apache.org/job/beam_PostCommit_Python_Verify/display/redirect', 
'BUILD_CAUSE_SCMTRIGGER': 'true', 'HUDSON_SERVER_COOKIE': 'f4ebd1e6b0d976e8', 
'EXECUTOR_NUMBER': '0', 'NODE_NAME': 'beam3', 'PWD': 
' 
'SPARK_LOCAL_IP': '127.0.0.1', 'JENKINS_SERVER_COOKIE': 'f4ebd1e6b0d976e8', 
'BUILD_DISPLAY_NAME': '#2710', 'ROOT_BUILD_CAUSE_SCMTRIGGER': 'true', 
'XFILESEARCHPATH': '/usr/dt/app-defaults/%L/Dt', 'MAIL': '/var/mail/jenkins', 
'SSH_CONNECTION': '140.211.11.14 46584 10.128.0.4 22', 
'RUN_CHANGES_DISPLAY_URL': 
'https://builds.apache.org/job/beam_PostCommit_Python_Verify/2710/display/redirect?page=changes',
 'GIT_URL': 'https://github.com/apache/beam.git'}

Processing ./target/.tox/dist/apache-beam-2.2.0.dev.zip
Collecting avro<2.0.0,>=1.8.1 (from apache-beam==2.2.0.dev0)
Collecting crcmod<2.0,>=1.7 (from apache-beam==2.2.0.dev0)
Collecting dill==0.2.6 (from apache-beam==2.2.0.dev0)
Requirement already satisfied: grpcio<2.0,>=1.0 in 
./target/.tox/py27cython/lib/python2.7/site-packages (from 
apache-beam==2.2.0.dev0)
Collecting httplib2<0.10,>=0.8 (from apache-beam==2.2.0.dev0)
Collecting mock<3.0.0,>=1.0.1 (from apache-beam==2.2.0.dev0)
  Using cached mock-2.0.0-py2.py3-none-any.whl
Collecting oauth2client<4.0.0,>=2.0.1 (from apache-beam==2.2.0.dev0)
Requirement already satisfied: protobuf<=3.3.0,>=3.2.0 in 
./target/.tox/py27cython/lib/python2.7/site-packages (from 
apache-beam==2.2.0.dev0)
Collecting pyyaml<4.0.0,>=3.12 (from apache-beam==2.2.0.dev0)
Requirement already satisfied: enum34>=1.0.4 in 
./target/.tox/py27cython/lib/python2.7/site-packages (from 
grpcio<2.0,>=1.0->apache-beam==2.2.0.dev0)
Requirement already satisfied: six>=1.5.2 in 
./target/.tox/py27cython/lib/python2.7/site-packages (from 
grpcio<2.0,>=1.0->apache-beam==2.2.0.dev0)
Requirement already satisfied: futures>=2.2.0 in 
./target/.tox/py27cython/lib/python2.7/site-packages (from 
grpcio<2.0,>=1.0->apache-beam==2.2.0.dev0)
Collecting pbr>=0.11 (from mock<3.0.0,>=1.0.1->apache-beam==2.2.0.dev0)
  Using cached pbr-3.1.1-py2.py3-none-any.whl
Collecting funcsigs>=1; 

[GitHub] beam pull request #3544: Cleanup and fix ptransform_fn decorator.

2017-07-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3544


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/2] beam git commit: Closes #3544

2017-07-11 Thread robertwb
Closes #3544


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/91c7d3d1
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/91c7d3d1
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/91c7d3d1

Branch: refs/heads/master
Commit: 91c7d3d1f7d72e84e773c1adbffed063aefdff3b
Parents: 8468210 2b86a61
Author: Robert Bradshaw 
Authored: Tue Jul 11 18:08:02 2017 -0700
Committer: Robert Bradshaw 
Committed: Tue Jul 11 18:08:02 2017 -0700

--
 sdks/python/apache_beam/transforms/combiners.py |  8 
 .../apache_beam/transforms/combiners_test.py|  7 +---
 .../python/apache_beam/transforms/ptransform.py | 41 +---
 3 files changed, 28 insertions(+), 28 deletions(-)
--




[1/2] beam git commit: Cleanup and fix ptransform_fn decorator.

2017-07-11 Thread robertwb
Repository: beam
Updated Branches:
  refs/heads/master 84682109b -> 91c7d3d1f


Cleanup and fix ptransform_fn decorator.

Previously CallablePTransform was being used both as the
factory and the transform itself, which could result in state
getting carried between pipelines.


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/2b86a61e
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/2b86a61e
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/2b86a61e

Branch: refs/heads/master
Commit: 2b86a61e5bb07d3bd7f958e124bc8d79dc300c3f
Parents: 8468210
Author: Robert Bradshaw 
Authored: Tue Jul 11 14:32:47 2017 -0700
Committer: Robert Bradshaw 
Committed: Tue Jul 11 18:08:01 2017 -0700

--
 sdks/python/apache_beam/transforms/combiners.py |  8 
 .../apache_beam/transforms/combiners_test.py|  7 +---
 .../python/apache_beam/transforms/ptransform.py | 41 +---
 3 files changed, 28 insertions(+), 28 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/2b86a61e/sdks/python/apache_beam/transforms/combiners.py
--
diff --git a/sdks/python/apache_beam/transforms/combiners.py 
b/sdks/python/apache_beam/transforms/combiners.py
index fa0742d..875306f 100644
--- a/sdks/python/apache_beam/transforms/combiners.py
+++ b/sdks/python/apache_beam/transforms/combiners.py
@@ -149,6 +149,7 @@ class Top(object):
   """Combiners for obtaining extremal elements."""
   # pylint: disable=no-self-argument
 
+  @staticmethod
   @ptransform.ptransform_fn
   def Of(pcoll, n, compare=None, *args, **kwargs):
 """Obtain a list of the compare-most N elements in a PCollection.
@@ -177,6 +178,7 @@ class Top(object):
 return pcoll | core.CombineGlobally(
 TopCombineFn(n, compare, key, reverse), *args, **kwargs)
 
+  @staticmethod
   @ptransform.ptransform_fn
   def PerKey(pcoll, n, compare=None, *args, **kwargs):
 """Identifies the compare-most N elements associated with each key.
@@ -210,21 +212,25 @@ class Top(object):
 return pcoll | core.CombinePerKey(
 TopCombineFn(n, compare, key, reverse), *args, **kwargs)
 
+  @staticmethod
   @ptransform.ptransform_fn
   def Largest(pcoll, n):
 """Obtain a list of the greatest N elements in a PCollection."""
 return pcoll | Top.Of(n)
 
+  @staticmethod
   @ptransform.ptransform_fn
   def Smallest(pcoll, n):
 """Obtain a list of the least N elements in a PCollection."""
 return pcoll | Top.Of(n, reverse=True)
 
+  @staticmethod
   @ptransform.ptransform_fn
   def LargestPerKey(pcoll, n):
 """Identifies the N greatest elements associated with each key."""
 return pcoll | Top.PerKey(n)
 
+  @staticmethod
   @ptransform.ptransform_fn
   def SmallestPerKey(pcoll, n, reverse=True):
 """Identifies the N least elements associated with each key."""
@@ -369,10 +375,12 @@ class Sample(object):
   """Combiners for sampling n elements without replacement."""
   # pylint: disable=no-self-argument
 
+  @staticmethod
   @ptransform.ptransform_fn
   def FixedSizeGlobally(pcoll, n):
 return pcoll | core.CombineGlobally(SampleCombineFn(n))
 
+  @staticmethod
   @ptransform.ptransform_fn
   def FixedSizePerKey(pcoll, n):
 return pcoll | core.CombinePerKey(SampleCombineFn(n))

http://git-wip-us.apache.org/repos/asf/beam/blob/2b86a61e/sdks/python/apache_beam/transforms/combiners_test.py
--
diff --git a/sdks/python/apache_beam/transforms/combiners_test.py 
b/sdks/python/apache_beam/transforms/combiners_test.py
index c79fec8..cd2b595 100644
--- a/sdks/python/apache_beam/transforms/combiners_test.py
+++ b/sdks/python/apache_beam/transforms/combiners_test.py
@@ -156,14 +156,11 @@ class CombineTest(unittest.TestCase):
 
   def test_combine_sample_display_data(self):
 def individual_test_per_key_dd(sampleFn, args, kwargs):
-  trs = [beam.CombinePerKey(sampleFn(*args, **kwargs)),
- beam.CombineGlobally(sampleFn(*args, **kwargs))]
+  trs = [sampleFn(*args, **kwargs)]
   for transform in trs:
 dd = DisplayData.create_from(transform)
 expected_items = [
-DisplayDataItemMatcher('fn', sampleFn.fn.__name__),
-DisplayDataItemMatcher('combine_fn',
-   transform.fn.__class__)]
+DisplayDataItemMatcher('fn', transform._fn.__name__)]
 if args:
   expected_items.append(
   DisplayDataItemMatcher('args', str(args)))

http://git-wip-us.apache.org/repos/asf/beam/blob/2b86a61e/sdks/python/apache_beam/transforms/ptransform.py
--
diff --git 

[jira] [Commented] (BEAM-2596) Break up Jenkins PreCommit into individual steps.

2017-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083255#comment-16083255
 ] 

ASF GitHub Bot commented on BEAM-2596:
--

GitHub user jasonkuster opened a pull request:

https://github.com/apache/beam/pull/3545

[BEAM-2596] Pipeline job for Jenkins PreCommit

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [x] Make sure tests pass via `mvn clean verify`.
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jasonkuster/beam pipeline

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3545.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3545


commit 40827bd2f76b8bbebc2ff4022349eff0e5c13b40
Author: Jason Kuster 
Date:   2017-06-28T23:22:52Z

Initial set of pipeline jobs.

Signed-off-by: Jason Kuster 

commit 3cc47ad80624db630678147fa98e040618b91173
Author: Jason Kuster 
Date:   2017-06-29T08:56:50Z

Fixed many build options and configurations.

Signed-off-by: Jason Kuster 

commit 37484977ea004adb3a8051b39cee16ca79962d81
Author: Jason Kuster 
Date:   2017-06-29T09:14:27Z

add code health and integration test items

Signed-off-by: Jason Kuster 

commit b55c56e8ac820b7ae3a9e7ffe18a87184efac6c0
Author: Jason Kuster 
Date:   2017-06-29T09:23:16Z

Stub out Python builds.

Signed-off-by: Jason Kuster 

commit 7e7efa154d8fba4cf44400f97b5e2c64a16577ec
Author: Jason Kuster 
Date:   2017-06-29T23:28:59Z

fix typo, remove python build, start on common job properties

Signed-off-by: Jason Kuster 

commit f00ebe435542f83e40e57de76409f354df3a28c6
Author: Jason Kuster 
Date:   2017-07-11T00:58:11Z

update Python pipelines

Signed-off-by: Jason Kuster 

commit 8e6a06ace9fb9285a15f6c218eaecb0760264fa1
Author: Jason Kuster 
Date:   2017-07-11T01:25:18Z

Reuse common options in common_job_properties

Signed-off-by: Jason Kuster 

commit 79ca7235fa5e4f320985dd7fd17d76ea2673ccdf
Author: Jason Kuster 
Date:   2017-07-11T01:34:24Z

Extract downstream settings into common_job_properties

Signed-off-by: Jason Kuster 

commit 8d474dc7b2664c50b217a2312894f6bc384588f4
Author: Jason Kuster 
Date:   2017-07-11T18:24:14Z

Pick up changes in Java_UnitTest, plus extracted scm into c_j_p.

Signed-off-by: Jason Kuster 

commit b1d47608b78fceb638314dc4a266975390d160c8
Author: Jason Kuster 
Date:   2017-07-11T18:30:50Z

Cut Maven executions down to just what they need.

Signed-off-by: Jason Kuster 

commit 64274296570ca851b5db0a24d7a8a8e6073ea1f2
Author: Jason Kuster 
Date:   2017-07-11T18:58:53Z

fixup! Cut Maven executions down to just what they need.

commit 856ced85b52f43a117ae0cef90a7ac069a32b8d9
Author: Jason Kuster 
Date:   2017-07-12T00:30:30Z

Some additional Maven invocation changes, plus actually error pipeline out.

Signed-off-by: Jason Kuster 

commit e5ff7f1c28929e4a331338babfb6d5737f635b03
Author: Jason Kuster 
Date:   2017-07-12T00:39:27Z

Add license to Pipeline job.

Signed-off-by: Jason Kuster 




> Break up Jenkins PreCommit into individual steps.
> -
>
> Key: BEAM-2596
> URL: https://issues.apache.org/jira/browse/BEAM-2596
> Project: Beam
>  Issue Type: New Feature
>  Components: build-system, testing
>Reporter: Jason Kuster
>Assignee: Jason Kuster
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3545: [BEAM-2596] Pipeline job for Jenkins PreCommit

2017-07-11 Thread jasonkuster
GitHub user jasonkuster opened a pull request:

https://github.com/apache/beam/pull/3545

[BEAM-2596] Pipeline job for Jenkins PreCommit

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [x] Make sure tests pass via `mvn clean verify`.
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jasonkuster/beam pipeline

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3545.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3545


commit 40827bd2f76b8bbebc2ff4022349eff0e5c13b40
Author: Jason Kuster 
Date:   2017-06-28T23:22:52Z

Initial set of pipeline jobs.

Signed-off-by: Jason Kuster 

commit 3cc47ad80624db630678147fa98e040618b91173
Author: Jason Kuster 
Date:   2017-06-29T08:56:50Z

Fixed many build options and configurations.

Signed-off-by: Jason Kuster 

commit 37484977ea004adb3a8051b39cee16ca79962d81
Author: Jason Kuster 
Date:   2017-06-29T09:14:27Z

add code health and integration test items

Signed-off-by: Jason Kuster 

commit b55c56e8ac820b7ae3a9e7ffe18a87184efac6c0
Author: Jason Kuster 
Date:   2017-06-29T09:23:16Z

Stub out Python builds.

Signed-off-by: Jason Kuster 

commit 7e7efa154d8fba4cf44400f97b5e2c64a16577ec
Author: Jason Kuster 
Date:   2017-06-29T23:28:59Z

fix typo, remove python build, start on common job properties

Signed-off-by: Jason Kuster 

commit f00ebe435542f83e40e57de76409f354df3a28c6
Author: Jason Kuster 
Date:   2017-07-11T00:58:11Z

update Python pipelines

Signed-off-by: Jason Kuster 

commit 8e6a06ace9fb9285a15f6c218eaecb0760264fa1
Author: Jason Kuster 
Date:   2017-07-11T01:25:18Z

Reuse common options in common_job_properties

Signed-off-by: Jason Kuster 

commit 79ca7235fa5e4f320985dd7fd17d76ea2673ccdf
Author: Jason Kuster 
Date:   2017-07-11T01:34:24Z

Extract downstream settings into common_job_properties

Signed-off-by: Jason Kuster 

commit 8d474dc7b2664c50b217a2312894f6bc384588f4
Author: Jason Kuster 
Date:   2017-07-11T18:24:14Z

Pick up changes in Java_UnitTest, plus extracted scm into c_j_p.

Signed-off-by: Jason Kuster 

commit b1d47608b78fceb638314dc4a266975390d160c8
Author: Jason Kuster 
Date:   2017-07-11T18:30:50Z

Cut Maven executions down to just what they need.

Signed-off-by: Jason Kuster 

commit 64274296570ca851b5db0a24d7a8a8e6073ea1f2
Author: Jason Kuster 
Date:   2017-07-11T18:58:53Z

fixup! Cut Maven executions down to just what they need.

commit 856ced85b52f43a117ae0cef90a7ac069a32b8d9
Author: Jason Kuster 
Date:   2017-07-12T00:30:30Z

Some additional Maven invocation changes, plus actually error pipeline out.

Signed-off-by: Jason Kuster 

commit e5ff7f1c28929e4a331338babfb6d5737f635b03
Author: Jason Kuster 
Date:   2017-07-12T00:39:27Z

Add license to Pipeline job.

Signed-off-by: Jason Kuster 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-2596) Break up Jenkins PreCommit into individual steps.

2017-07-11 Thread Jason Kuster (JIRA)
Jason Kuster created BEAM-2596:
--

 Summary: Break up Jenkins PreCommit into individual steps.
 Key: BEAM-2596
 URL: https://issues.apache.org/jira/browse/BEAM-2596
 Project: Beam
  Issue Type: New Feature
  Components: build-system, testing
Reporter: Jason Kuster
Assignee: Jason Kuster






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Spark #2611

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-995) Apache Pig DSL

2017-07-11 Thread James Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083235#comment-16083235
 ] 

James Xu commented on BEAM-995:
---

I think the former is better, the execution engine for beam program is
spark,storm etc, and it is also the approach beam sql is taking.



> Apache Pig DSL
> --
>
> Key: BEAM-995
> URL: https://issues.apache.org/jira/browse/BEAM-995
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-ideas
>Reporter: Jean-Baptiste Onofré
>Assignee: Jean-Baptiste Onofré
>
> Apache Pig is still popular and the language is not so large.
> Providing a DSL using the Pig language would potentially allow more people to 
> use Beam (at least during a transition period).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Flink #3377

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Updated] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-11 Thread Andrea Pierleoni (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrea Pierleoni updated BEAM-2595:
---
Description: 
I am trying to use the new `WriteToBigQuery` PTransform added to 
`apache_beam.io.gcp.bigquery` in version 2.1.0-RC1

I need to write to a bigquery table with nested fields.
The only way to specify nested schemas in bigquery is with teh json schema.
None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the json 
schema, but they accept a schema as an instance of the class 
`apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`

I am composing the `TableFieldSchema` as suggested here 
[https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
 and it looks fine when passed to the PTransform `WriteToBigQuery`. 

The problem is that the base class `PTransformWithSideInputs` try to pickle and 
unpickle the function 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
  (that includes the TableFieldSchema instance) and for some reason when the 
class is unpickled some `FieldList` instance are converted to simple lists, and 
the pickling validation fails.

Would it be possible to extend the test coverage to nested json objects for 
bigquery?
They are also relatively easy to parse into a TableFieldSchema.


  was:
I am trying to use the new `WriteToBigQuery` PTransform added to 
`apache_beam.io.gcp.bigquery` in version 2.1.0-RC1

I need to write to a bigquery table with nested fields.
The only way to specify nested schemas in bigquery is with teh json schema.
None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the json 
schema, but they accept a schema as an instance of the class 
`apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`

I am composing the `TableFieldSchema` as suggested 
[here](https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436),
 and it looks fine when passed to the PTransform `WriteToBigQuery`. 

The problem is that the base class `PTransformWithSideInputs` try to [pickle 
and unpickle the 
function](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515)
  (that includes the TableFieldSchema instance) and for some reason when the 
class is unpickled some `FieldList` instance are converted to simple lists, and 
the pickling validation fails.

Would it be possible to extend the test coverage to nested json objects for 
bigquery?
They are also relatively easy to parse into a TableFieldSchema.



> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Thomas Groh
>Priority: Minor
>  Labels: gcp
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-11 Thread Andrea Pierleoni (JIRA)
Andrea Pierleoni created BEAM-2595:
--

 Summary: WriteToBigQuery does not work with nested json schema
 Key: BEAM-2595
 URL: https://issues.apache.org/jira/browse/BEAM-2595
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Affects Versions: 2.1.0
 Environment: mac os local runner, Python
Reporter: Andrea Pierleoni
Assignee: Thomas Groh
Priority: Minor


I am trying to use the new `WriteToBigQuery` PTransform added to 
`apache_beam.io.gcp.bigquery` in version 2.1.0-RC1

I need to write to a bigquery table with nested fields.
The only way to specify nested schemas in bigquery is with teh json schema.
None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the json 
schema, but they accept a schema as an instance of the class 
`apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`

I am composing the `TableFieldSchema` as suggested 
[here](https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436),
 and it looks fine when passed to the PTransform `WriteToBigQuery`. 

The problem is that the base class `PTransformWithSideInputs` try to [pickle 
and unpickle the 
function](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515)
  (that includes the TableFieldSchema instance) and for some reason when the 
class is unpickled some `FieldList` instance are converted to simple lists, and 
the pickling validation fails.

Would it be possible to extend the test coverage to nested json objects for 
bigquery?
They are also relatively easy to parse into a TableFieldSchema.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Spark #2610

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-2511) TextIO should support reading a PCollection of filenames

2017-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083168#comment-16083168
 ] 

ASF GitHub Bot commented on BEAM-2511:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3443


> TextIO should support reading a PCollection of filenames
> 
>
> Key: BEAM-2511
> URL: https://issues.apache.org/jira/browse/BEAM-2511
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> Motivation and proposed implementation in https://s.apache.org/textio-sdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Flink #3376

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Closed] (BEAM-2511) TextIO should support reading a PCollection of filenames

2017-07-11 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2511.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> TextIO should support reading a PCollection of filenames
> 
>
> Key: BEAM-2511
> URL: https://issues.apache.org/jira/browse/BEAM-2511
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
> Fix For: 2.2.0
>
>
> Motivation and proposed implementation in https://s.apache.org/textio-sdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Build failed in Jenkins: beam_PostCommit_Java_MavenInstall #4354

2017-07-11 Thread Apache Jenkins Server
See 


Changes:

[kirpichov] Adds TextIO.readAll(), implemented rather naively

--
[...truncated 350.35 KB...]
2017-07-11T23:29:22.756 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/com/google/protobuf/protobuf-java-util/3.2.0/protobuf-java-util-3.2.0.pom
2017-07-11T23:29:22.795 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/com/google/protobuf/protobuf-java-util/3.2.0/protobuf-java-util-3.2.0.pom
 (5 KB at 106.4 KB/sec)
2017-07-11T23:29:22.798 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/com/google/code/gson/gson/2.7/gson-2.7.pom
2017-07-11T23:29:22.831 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/com/google/code/gson/gson/2.7/gson-2.7.pom 
(2 KB at 42.8 KB/sec)
2017-07-11T23:29:22.833 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/com/google/code/gson/gson-parent/2.7/gson-parent-2.7.pom
2017-07-11T23:29:22.864 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/com/google/code/gson/gson-parent/2.7/gson-parent-2.7.pom
 (4 KB at 116.5 KB/sec)
2017-07-11T23:29:22.866 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/io/grpc/grpc-protobuf-lite/1.2.0/grpc-protobuf-lite-1.2.0.pom
2017-07-11T23:29:22.897 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/io/grpc/grpc-protobuf-lite/1.2.0/grpc-protobuf-lite-1.2.0.pom
 (3 KB at 69.1 KB/sec)
2017-07-11T23:29:22.899 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/io/grpc/grpc-stub/1.2.0/grpc-stub-1.2.0.pom
2017-07-11T23:29:22.924 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/io/grpc/grpc-stub/1.2.0/grpc-stub-1.2.0.pom
 (3 KB at 81.2 KB/sec)
[JENKINS] Archiving disabled
2017-07-11T23:29:25.142 [INFO]  
   
2017-07-11T23:29:25.142 [INFO] 

2017-07-11T23:29:25.142 [INFO] Skipping Apache Beam :: Parent
2017-07-11T23:29:25.142 [INFO] This project has been banned from the build due 
to previous failures.
2017-07-11T23:29:25.142 [INFO] 

[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled2017-07-11T23:29:36.024 [INFO] 

2017-07-11T23:29:36.024 [INFO] Reactor Summary:
2017-07-11T23:29:36.024 [INFO] 
2017-07-11T23:29:36.024 [INFO] Apache Beam :: Parent 
.. SUCCESS [ 33.269 s]
2017-07-11T23:29:36.024 [INFO] Apache Beam :: SDKs :: Java :: Build Tools 
. SUCCESS [ 11.871 s]
2017-07-11T23:29:36.024 [INFO] Apache Beam :: SDKs 
 SUCCESS [  5.460 s]
2017-07-11T23:29:36.024 [INFO] Apache Beam :: SDKs :: Common 
.. SUCCESS [  5.603 s]

2017-07-11T23:29:36.024 [INFO] Apache Beam :: SDKs :: Common :: Runner API 
 SUCCESS [ 30.841 s]
2017-07-11T23:29:36.024 [INFO] Apache Beam :: SDKs :: Common :: Fn API 
 FAILURE [  0.811 s]
2017-07-11T23:29:36.024 [INFO] Apache Beam :: SDKs :: Java 
 SKIPPED
2017-07-11T23:29:36.024 [INFO] Apache Beam :: SDKs :: Java :: Core 
 SKIPPED
2017-07-11T23:29:36.024 [INFO] Apache Beam :: Runners 
. SKIPPED
2017-07-11T23:29:36.024 [INFO] Apache Beam :: Runners :: Core Construction Java 
... SKIPPED
2017-07-11T23:29:36.024 [INFO] Apache Beam :: Runners :: Core Java 
 SKIPPED
2017-07-11T23:29:36.024 [INFO] Apache Beam :: 

[GitHub] beam pull request #3443: [BEAM-2511] Implements TextIO.ReadAll

2017-07-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3443


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/2] beam git commit: This closes #3443: [BEAM-2511] Implements TextIO.ReadAll

2017-07-11 Thread jkff
This closes #3443: [BEAM-2511] Implements TextIO.ReadAll


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/84682109
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/84682109
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/84682109

Branch: refs/heads/master
Commit: 84682109b9ff186f8c850a0b71de5163949c2769
Parents: 011e279 cd216f7
Author: Eugene Kirpichov 
Authored: Tue Jul 11 16:12:40 2017 -0700
Committer: Eugene Kirpichov 
Committed: Tue Jul 11 16:12:40 2017 -0700

--
 ...ndedSplittableProcessElementInvokerTest.java |   2 +-
 .../core/SplittableParDoProcessFnTest.java  |   2 +-
 .../DataflowPipelineTranslatorTest.java |   2 +-
 .../apache/beam/sdk/io/CompressedSource.java|  40 ++--
 .../apache/beam/sdk/io/OffsetBasedSource.java   |  22 +-
 .../java/org/apache/beam/sdk/io/TextIO.java | 230 +--
 .../apache/beam/sdk/io/range/OffsetRange.java   | 101 
 .../beam/sdk/io/range/OffsetRangeTracker.java   |   3 +
 .../transforms/splittabledofn/OffsetRange.java  |  77 ---
 .../splittabledofn/OffsetRangeTracker.java  |   1 +
 .../java/org/apache/beam/sdk/io/TextIOTest.java |  62 +++--
 .../beam/sdk/transforms/SplittableDoFnTest.java |   2 +-
 .../splittabledofn/OffsetRangeTrackerTest.java  |   1 +
 13 files changed, 387 insertions(+), 158 deletions(-)
--




[1/2] beam git commit: Adds TextIO.readAll(), implemented rather naively

2017-07-11 Thread jkff
Repository: beam
Updated Branches:
  refs/heads/master 011e2796d -> 84682109b


Adds TextIO.readAll(), implemented rather naively


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/cd216f79
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/cd216f79
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/cd216f79

Branch: refs/heads/master
Commit: cd216f796bebf78101dce7ab6387f3db9b839fc7
Parents: 011e279
Author: Eugene Kirpichov 
Authored: Fri Jun 23 18:02:10 2017 -0700
Committer: Eugene Kirpichov 
Committed: Tue Jul 11 16:06:41 2017 -0700

--
 ...ndedSplittableProcessElementInvokerTest.java |   2 +-
 .../core/SplittableParDoProcessFnTest.java  |   2 +-
 .../DataflowPipelineTranslatorTest.java |   2 +-
 .../apache/beam/sdk/io/CompressedSource.java|  40 ++--
 .../apache/beam/sdk/io/OffsetBasedSource.java   |  22 +-
 .../java/org/apache/beam/sdk/io/TextIO.java | 230 +--
 .../apache/beam/sdk/io/range/OffsetRange.java   | 101 
 .../beam/sdk/io/range/OffsetRangeTracker.java   |   3 +
 .../transforms/splittabledofn/OffsetRange.java  |  77 ---
 .../splittabledofn/OffsetRangeTracker.java  |   1 +
 .../java/org/apache/beam/sdk/io/TextIOTest.java |  62 +++--
 .../beam/sdk/transforms/SplittableDoFnTest.java |   2 +-
 .../splittabledofn/OffsetRangeTrackerTest.java  |   1 +
 13 files changed, 387 insertions(+), 158 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/cd216f79/runners/core-java/src/test/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvokerTest.java
--
diff --git 
a/runners/core-java/src/test/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvokerTest.java
 
b/runners/core-java/src/test/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvokerTest.java
index a2f6acc..b80a632 100644
--- 
a/runners/core-java/src/test/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvokerTest.java
+++ 
b/runners/core-java/src/test/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvokerTest.java
@@ -25,10 +25,10 @@ import static org.junit.Assert.assertThat;
 
 import java.util.Collection;
 import java.util.concurrent.Executors;
+import org.apache.beam.sdk.io.range.OffsetRange;
 import org.apache.beam.sdk.options.PipelineOptionsFactory;
 import org.apache.beam.sdk.transforms.DoFn;
 import org.apache.beam.sdk.transforms.reflect.DoFnInvokers;
-import org.apache.beam.sdk.transforms.splittabledofn.OffsetRange;
 import org.apache.beam.sdk.transforms.splittabledofn.OffsetRangeTracker;
 import org.apache.beam.sdk.transforms.windowing.BoundedWindow;
 import org.apache.beam.sdk.transforms.windowing.GlobalWindow;

http://git-wip-us.apache.org/repos/asf/beam/blob/cd216f79/runners/core-java/src/test/java/org/apache/beam/runners/core/SplittableParDoProcessFnTest.java
--
diff --git 
a/runners/core-java/src/test/java/org/apache/beam/runners/core/SplittableParDoProcessFnTest.java
 
b/runners/core-java/src/test/java/org/apache/beam/runners/core/SplittableParDoProcessFnTest.java
index 9543de8..1cd1275 100644
--- 
a/runners/core-java/src/test/java/org/apache/beam/runners/core/SplittableParDoProcessFnTest.java
+++ 
b/runners/core-java/src/test/java/org/apache/beam/runners/core/SplittableParDoProcessFnTest.java
@@ -39,11 +39,11 @@ import org.apache.beam.sdk.coders.BigEndianIntegerCoder;
 import org.apache.beam.sdk.coders.Coder;
 import org.apache.beam.sdk.coders.InstantCoder;
 import org.apache.beam.sdk.coders.SerializableCoder;
+import org.apache.beam.sdk.io.range.OffsetRange;
 import org.apache.beam.sdk.testing.TestPipeline;
 import org.apache.beam.sdk.transforms.DoFn;
 import org.apache.beam.sdk.transforms.DoFnTester;
 import org.apache.beam.sdk.transforms.splittabledofn.HasDefaultTracker;
-import org.apache.beam.sdk.transforms.splittabledofn.OffsetRange;
 import org.apache.beam.sdk.transforms.splittabledofn.OffsetRangeTracker;
 import org.apache.beam.sdk.transforms.splittabledofn.RestrictionTracker;
 import org.apache.beam.sdk.transforms.windowing.BoundedWindow;

http://git-wip-us.apache.org/repos/asf/beam/blob/cd216f79/runners/google-cloud-dataflow-java/src/test/java/org/apache/beam/runners/dataflow/DataflowPipelineTranslatorTest.java
--
diff --git 
a/runners/google-cloud-dataflow-java/src/test/java/org/apache/beam/runners/dataflow/DataflowPipelineTranslatorTest.java
 

[jira] [Commented] (BEAM-2430) Java FnApiDoFnRunner to share across runners

2017-07-11 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083138#comment-16083138
 ] 

Kenneth Knowles commented on BEAM-2430:
---

Ah, OK. From within the Java SDK harness, the DoFnRunner probably doesn't need 
to know about the Fn API. So I got confused because the code in the PR called 
it a FnApiDoFnRunner. This ticket is, indeed, for a FnApi\[Client\]DoFnRunner.

> Java FnApiDoFnRunner to share across runners
> 
>
> Key: BEAM-2430
> URL: https://issues.apache.org/jira/browse/BEAM-2430
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: beam-python-everywhere
> Fix For: 2.1.0
>
>
> As the portability framework comes into focus, let's fill out the support 
> code for making it easy to onboard a new runner.
> There is some amount of using the Fn API that has to do only with the fact 
> that a runner is implemented in Java, and is not specific to that runner. 
> This should be part of the runners-core library, and designed so that a 
> runner can set it up however it likes, and just pass elements without having 
> to explicitly manage things like requests, responses, protos, and coders.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2591) Python shim for submitting to FlinkRunner

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-2591:
--
Component/s: runner-flink

> Python shim for submitting to FlinkRunner
> -
>
> Key: BEAM-2591
> URL: https://issues.apache.org/jira/browse/BEAM-2591
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink, sdk-py
>Reporter: Kenneth Knowles
>Assignee: Sourabh Bajaj
>  Labels: beam-python-everywhere
>
> Whatever the result of https://s.apache.org/beam-job-api, Python users will 
> need to be able to pass --runner=FlinkRunner and have it work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Dmitry Demeshchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083108#comment-16083108
 ] 

Dmitry Demeshchuk edited comment on BEAM-2572 at 7/11/17 10:39 PM:
---

For example, in my particular case, it would be something like

{code}
read_credentials = beam.io.aws.Credentials(aws_access_key_id='...', 
aws_secret_access_key='...', region='...')
write_credentials = beam.io.aws.Credentials(aws_access_key_id='...', 
aws_secret_access_key='...', region='...')

(p
   | 'Read' >> beam.io.aws.ReadFromS3('s3://mybucket/mykey', 
credentials=read_credentials)
   | 'Transform' >> beam.Map(lambda line: line + '!!!')
   | 'Write' >> beam.io.aws.WriteToS3('s3://myotherbucket/myotherkey', 
credentials=write_credentials)
)
{code}

I actually like this interface even better, but it may mean writing a lot of 
FileSystem-like code from scratch :(


was (Author: demeshchuk):
For example, in my particular case, it would be something like

{code}
credentials = beam.io.aws.Credentials(aws_access_key_id='...', 
aws_secret_access_key='...', region='...')
(p
   | 'Read' >> beam.io.aws.ReadFromS3('s3://mybucket/mykey', 
credentials=credentials)
   | 'Transform' >> beam.Map(lambda line: line + '!!!')
   | 'Write' >> beam.io.aws.WriteToS3('s3://myotherbucket/myotherkey', 
credentials=credentials)
)
{code}

I actually like this interface even better, but it may mean writing a lot of 
FileSystem-like code from scratch :(

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Dmitry Demeshchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083101#comment-16083101
 ] 

Dmitry Demeshchuk commented on BEAM-2572:
-

[~robertwb]: yes, by "hidden" I meant not exposing some options in the UI 
(similar to https://issues.apache.org/jira/browse/BEAM-2492)

To your first comment, I don't fully understand if "construction" means 
construction of each PTransform object, or whether it means construction of the 
whole pipeline (meaning, applying each PTransform to it). My initial idea was 
to somehow pass the pipeline options during the application phase.

Also, I like your point about having different credentials for different 
stages, but in that case I'd rather supply credentials exactly at construction 
time to each individual PTransform. For example, `{{p | 
beam.io.aws.ReadFromDynamoDB(query='...', credentials=...)}}. However, as 
multiple people pointed out before, this will be poorly compatible with the 
filesystems approach.

Maybe the answer here is to have filesystems that require explicit 
authentication use a different interface, not {{ReadFromText}} and 
{{WriteToText}}?

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083088#comment-16083088
 ] 

Robert Bradshaw commented on BEAM-2572:
---

It should be noted that there was recently a huge effort to *remove* the access 
of PipelineOptions from *construction* in Java, e.g. 
https://issues.apache.org/jira/browse/BEAM-827?jql=text%20~%20%22Remove%20PipelineOptions%20from%20construction%22
 However, making filesystems aware of pipeline options/credentials (at 
instantiation? via a method from the FileSystem baseclass?) does however make a 
lot of sense. One thing to keep in mind is that it would be good to be able to 
scope access (e.g. if different credentials are required to read/write 
different sets of files in the same pipeline).

Could you clarify what is meant by "hidden?" Presumably this is data that you 
don't want accidentally exposed (e.g. on the monitoring UI)?


> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Dmitry Demeshchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083113#comment-16083113
 ] 

Dmitry Demeshchuk commented on BEAM-2572:
-

Or, alternatively, we could state that filesystem sources and sinks may 
actually have an optional authentication parameter, since quite a few 
filesystems generally require authentication.

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Dmitry Demeshchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083108#comment-16083108
 ] 

Dmitry Demeshchuk commented on BEAM-2572:
-

For example, in my particular case, it would be something like

{code}
credentials = beam.io.aws.Credentials(aws_access_key_id='...', 
aws_secret_access_key='...', region='...')
(p
   | 'Read' >> beam.io.aws.ReadFromS3('s3://mybucket/mykey', 
credentials=credentials)
   | 'Transform' >> beam.Map(lambda line: line + '!!!')
   | 'Write' >> beam.io.aws.WriteToS3('s3://myotherbucket/myotherkey', 
credentials=credentials)
)
{code}

I actually like this interface even better, but it may mean writing a lot of 
FileSystem-like code from scratch :(

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (BEAM-2430) Java FnApiDoFnRunner to share across runners

2017-07-11 Thread Luke Cwik (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik reopened BEAM-2430:
-
  Assignee: Kenneth Knowles  (was: Luke Cwik)

PR/3432 was about writing a DoFnRunner for executing DoFns within the SDK 
harness and has very little to do with creating a DoFnRunner that exercises the 
Fn API (registration, process bundle, transmit data, ...) to tell an SDK 
harness to do some meaningful work over the Fn API.

> Java FnApiDoFnRunner to share across runners
> 
>
> Key: BEAM-2430
> URL: https://issues.apache.org/jira/browse/BEAM-2430
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: beam-python-everywhere
> Fix For: 2.1.0
>
>
> As the portability framework comes into focus, let's fill out the support 
> code for making it easy to onboard a new runner.
> There is some amount of using the Fn API that has to do only with the fact 
> that a runner is implemented in Java, and is not specific to that runner. 
> This should be part of the runners-core library, and designed so that a 
> runner can set it up however it likes, and just pass elements without having 
> to explicitly manage things like requests, responses, protos, and coders.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2594) Python shim for submitting to Java DirectRunner

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-2594:
--
Component/s: (was: runner-spark)
 runner-direct

> Python shim for submitting to Java DirectRunner
> ---
>
> Key: BEAM-2594
> URL: https://issues.apache.org/jira/browse/BEAM-2594
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-direct, sdk-py
>Reporter: Kenneth Knowles
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: beam-python-everywhere
>
> Whatever the result of https://s.apache.org/beam-job-api, it is an 
> interesting exercise to also be able to submit to the Java DirectRunner. With 
> cross-language submission and mixed-language pipelines, which DirectRunner is 
> best for debugging problems may vary. Kind of a funky space to be in!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2430) Java FnApiDoFnRunner to share across runners

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-2430:
-

Assignee: Luke Cwik

> Java FnApiDoFnRunner to share across runners
> 
>
> Key: BEAM-2430
> URL: https://issues.apache.org/jira/browse/BEAM-2430
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Luke Cwik
>  Labels: beam-python-everywhere
> Fix For: 2.1.0
>
>
> As the portability framework comes into focus, let's fill out the support 
> code for making it easy to onboard a new runner.
> There is some amount of using the Fn API that has to do only with the fact 
> that a runner is implemented in Java, and is not specific to that runner. 
> This should be part of the runners-core library, and designed so that a 
> runner can set it up however it likes, and just pass elements without having 
> to explicitly manage things like requests, responses, protos, and coders.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Spark #2609

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-2566) Java SDK harness should not depend on any runner

2017-07-11 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083080#comment-16083080
 ] 

Kenneth Knowles commented on BEAM-2566:
---

[~robertwb] any chance you are a good candidate to do similar things with the 
Python harness?

> Java SDK harness should not depend on any runner
> 
>
> Key: BEAM-2566
> URL: https://issues.apache.org/jira/browse/BEAM-2566
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Luke Cwik
>
> Right now there is a dependency on the Dataflow runner. I believe this is 
> legacy due to using {{CloudObject}} temporarily but I do not claim to 
> understand the full nature of the dependency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2593) Python shim for submitting to SparkRunner

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2593:
-

 Summary: Python shim for submitting to SparkRunner
 Key: BEAM-2593
 URL: https://issues.apache.org/jira/browse/BEAM-2593
 Project: Beam
  Issue Type: New Feature
  Components: runner-spark, sdk-py
Reporter: Kenneth Knowles
Assignee: Sourabh Bajaj


Whatever the result of https://s.apache.org/beam-job-api, Python users will 
need to be able to pass --runner=SparkRunner and have it work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Flink #3375

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Created] (BEAM-2589) ApexRunner shim for Job API

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2589:
-

 Summary: ApexRunner shim for Job API
 Key: BEAM-2589
 URL: https://issues.apache.org/jira/browse/BEAM-2589
 Project: Beam
  Issue Type: New Feature
  Components: runner-apex
Reporter: Kenneth Knowles
Assignee: Thomas Weise


Whatever the result of https://s.apache.org/beam-job-api we will need a way for 
the JVM-based ApexRunner to receive and run pipelines authors in Python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2588) FlinkRunner shim for serving Job API

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2588:
-

 Summary: FlinkRunner shim for serving Job API
 Key: BEAM-2588
 URL: https://issues.apache.org/jira/browse/BEAM-2588
 Project: Beam
  Issue Type: New Feature
  Components: runner-flink
Reporter: Kenneth Knowles
Assignee: Aljoscha Krettek


Whatever the result of https://s.apache.org/beam-job-api we will need a way for 
the JVM-based FlinkRunner to receive and run pipelines authors in Python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2590) SparkRunner shim for Job API

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2590:
-

 Summary: SparkRunner shim for Job API
 Key: BEAM-2590
 URL: https://issues.apache.org/jira/browse/BEAM-2590
 Project: Beam
  Issue Type: New Feature
  Components: runner-spark
Reporter: Kenneth Knowles
Assignee: Amit Sela


Whatever the result of https://s.apache.org/beam-job-api we will need a way for 
the JVM-based SparkRunner to receive and run pipelines authors in Python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3544: Cleanup and fix ptransform_fn decorator.

2017-07-11 Thread robertwb
GitHub user robertwb opened a pull request:

https://github.com/apache/beam/pull/3544

Cleanup and fix ptransform_fn decorator.

Previously CallablePTransform was being used both as the
factory and the transform itself, which could result in state
getting carried between pipelines.

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`.
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/robertwb/incubator-beam ptransform_fn

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3544.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3544


commit ead2b60329ffccf31e2a493af0011ccd56a12a85
Author: Robert Bradshaw 
Date:   2017-07-11T21:32:47Z

Cleanup and fix ptransform_fn decorator.

Previously CallablePTransform was being used both as the
factory and the transform itself, which could result in state
getting carried between pipelines.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-2594) Python shim for submitting to Java DirectRunner

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2594:
-

 Summary: Python shim for submitting to Java DirectRunner
 Key: BEAM-2594
 URL: https://issues.apache.org/jira/browse/BEAM-2594
 Project: Beam
  Issue Type: New Feature
  Components: runner-spark, sdk-py
Reporter: Kenneth Knowles
Assignee: Sourabh Bajaj
Priority: Minor


Whatever the result of https://s.apache.org/beam-job-api, it is an interesting 
exercise to also be able to submit to the Java DirectRunner. With 
cross-language submission and mixed-language pipelines, which DirectRunner is 
best for debugging problems may vary. Kind of a funky space to be in!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2592) Python shim for submitting to ApexRunner

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2592:
-

 Summary: Python shim for submitting to ApexRunner
 Key: BEAM-2592
 URL: https://issues.apache.org/jira/browse/BEAM-2592
 Project: Beam
  Issue Type: New Feature
  Components: runner-apex, sdk-py
Reporter: Kenneth Knowles
Assignee: Sourabh Bajaj


Whatever the result of https://s.apache.org/beam-job-api, Python users will 
need to be able to pass --runner=ApexRunner and have it work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2591) Python shim for submitting to FlinkRunner

2017-07-11 Thread Kenneth Knowles (JIRA)
Kenneth Knowles created BEAM-2591:
-

 Summary: Python shim for submitting to FlinkRunner
 Key: BEAM-2591
 URL: https://issues.apache.org/jira/browse/BEAM-2591
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py
Reporter: Kenneth Knowles
Assignee: Sourabh Bajaj


Whatever the result of https://s.apache.org/beam-job-api, Python users will 
need to be able to pass --runner=FlinkRunner and have it work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3542: Remove dead (and wrong) viewFromProto overload

2017-07-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3542


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[1/2] beam git commit: Remove dead (and wrong) viewFromProto overload

2017-07-11 Thread kenn
Repository: beam
Updated Branches:
  refs/heads/master a22f1a05a -> 011e2796d


Remove dead (and wrong) viewFromProto overload


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/aaffe15e
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/aaffe15e
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/aaffe15e

Branch: refs/heads/master
Commit: aaffe15ee891b5656e448eac4bd3a7ff72eee315
Parents: 138641f
Author: Kenneth Knowles 
Authored: Tue Jul 11 10:09:12 2017 -0700
Committer: Kenneth Knowles 
Committed: Tue Jul 11 14:11:39 2017 -0700

--
 .../core/construction/ParDoTranslation.java | 21 
 .../core/construction/ParDoTranslationTest.java |  2 +-
 2 files changed, 1 insertion(+), 22 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/aaffe15e/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ParDoTranslation.java
--
diff --git 
a/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ParDoTranslation.java
 
b/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ParDoTranslation.java
index 90c9aad..03f29ff 100644
--- 
a/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ParDoTranslation.java
+++ 
b/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ParDoTranslation.java
@@ -41,7 +41,6 @@ import java.util.List;
 import java.util.Map;
 import java.util.Set;
 import 
org.apache.beam.runners.core.construction.PTransformTranslation.TransformPayloadTranslator;
-import org.apache.beam.sdk.Pipeline;
 import org.apache.beam.sdk.coders.Coder;
 import org.apache.beam.sdk.coders.IterableCoder;
 import org.apache.beam.sdk.common.runner.v1.RunnerApi;
@@ -509,26 +508,6 @@ public class ParDoTranslation {
 return builder.build();
   }
 
-  public static PCollectionView viewFromProto(
-  Pipeline pipeline,
-  SideInput sideInput,
-  String localName,
-  RunnerApi.PTransform parDoTransform,
-  Components components)
-  throws IOException {
-
-String pCollectionId = parDoTransform.getInputsOrThrow(localName);
-
-// This may be a PCollection defined in another language, but we should be
-// able to rehydrate it enough to stick it in a side input. The coder may 
not
-// be grokkable in Java.
-PCollection pCollection =
-PCollectionTranslation.fromProto(
-pipeline, components.getPcollectionsOrThrow(pCollectionId), 
components);
-
-return viewFromProto(sideInput, localName, pCollection, parDoTransform, 
components);
-  }
-
   /**
* Create a {@link PCollectionView} from a side input spec and an 
already-deserialized {@link
* PCollection} that should be wired up.

http://git-wip-us.apache.org/repos/asf/beam/blob/aaffe15e/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ParDoTranslationTest.java
--
diff --git 
a/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ParDoTranslationTest.java
 
b/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ParDoTranslationTest.java
index 6fdf9d6..a87a16d 100644
--- 
a/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ParDoTranslationTest.java
+++ 
b/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ParDoTranslationTest.java
@@ -162,9 +162,9 @@ public class ParDoTranslationTest {
 SideInput sideInput = 
parDoPayload.getSideInputsOrThrow(view.getTagInternal().getId());
 PCollectionView restoredView =
 ParDoTranslation.viewFromProto(
-rehydratedPipeline,
 sideInput,
 view.getTagInternal().getId(),
+view.getPCollection(),
 protoTransform,
 protoComponents);
 assertThat(restoredView.getTagInternal(), 
equalTo(view.getTagInternal()));



[2/2] beam git commit: This closes #3542: Remove dead (and wrong) viewFromProto overload

2017-07-11 Thread kenn
This closes #3542: Remove dead (and wrong) viewFromProto overload


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/011e2796
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/011e2796
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/011e2796

Branch: refs/heads/master
Commit: 011e2796df140b54e33c719b72e176da5d5d24dd
Parents: a22f1a0 aaffe15
Author: Kenneth Knowles 
Authored: Tue Jul 11 14:27:10 2017 -0700
Committer: Kenneth Knowles 
Committed: Tue Jul 11 14:27:10 2017 -0700

--
 .../core/construction/ParDoTranslation.java | 21 
 .../core/construction/ParDoTranslationTest.java |  2 +-
 2 files changed, 1 insertion(+), 22 deletions(-)
--




[jira] [Resolved] (BEAM-2430) Java FnApiDoFnRunner to share across runners

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-2430.
---
   Resolution: Implemented
Fix Version/s: 2.1.0

> Java FnApiDoFnRunner to share across runners
> 
>
> Key: BEAM-2430
> URL: https://issues.apache.org/jira/browse/BEAM-2430
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Luke Cwik
>  Labels: beam-python-everywhere
> Fix For: 2.1.0
>
>
> As the portability framework comes into focus, let's fill out the support 
> code for making it easy to onboard a new runner.
> There is some amount of using the Fn API that has to do only with the fact 
> that a runner is implemented in Java, and is not specific to that runner. 
> This should be part of the runners-core library, and designed so that a 
> runner can set it up however it likes, and just pass elements without having 
> to explicitly manage things like requests, responses, protos, and coders.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2371) Make Java DirectRunner demonstrate language-agnostic Runner API translation wrappers

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-2371:
--
Labels: beam-python-everywhere  (was: )

> Make Java DirectRunner demonstrate language-agnostic Runner API translation 
> wrappers
> 
>
> Key: BEAM-2371
> URL: https://issues.apache.org/jira/browse/BEAM-2371
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-direct
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: beam-python-everywhere
> Fix For: 2.2.0
>
>
> This will complete the PoC for runners-core-construction-java and the Runner 
> API and show other runners the easy path to executing non-Java pipelines, 
> modulo Fn API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2431) Model Runner interactions in RPC layer for Runner API

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-2431:
--
Labels: beam-python-everywhere  (was: )

> Model Runner interactions in RPC layer for Runner API
> -
>
> Key: BEAM-2431
> URL: https://issues.apache.org/jira/browse/BEAM-2431
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model-runner-api
>Reporter: Kenneth Knowles
>Assignee: Sourabh Bajaj
>  Labels: beam-python-everywhere
>
> The "Runner API" today is actually just a definition of what constitutes a 
> Beam pipeline. It needs to actually be a (very small) API.
> This would allow e.g. a Java-based job launcher to respond to launch requests 
> and state queries from a Python-based adapter.
> The expected API would be something like a distillation of the APIs for 
> PipelineRunner and PipelineResult (which is really "Job") via analyzing how 
> these both look in Java and Python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2430) Java FnApiDoFnRunner to share across runners

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-2430:
--
Labels: beam-python-everywhere  (was: )

> Java FnApiDoFnRunner to share across runners
> 
>
> Key: BEAM-2430
> URL: https://issues.apache.org/jira/browse/BEAM-2430
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-core
>Reporter: Kenneth Knowles
>  Labels: beam-python-everywhere
>
> As the portability framework comes into focus, let's fill out the support 
> code for making it easy to onboard a new runner.
> There is some amount of using the Fn API that has to do only with the fact 
> that a runner is implemented in Java, and is not specific to that runner. 
> This should be part of the runners-core library, and designed so that a 
> runner can set it up however it likes, and just pass elements without having 
> to explicitly manage things like requests, responses, protos, and coders.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2571) Flink ValidatesRunner failing CombineTest.testSlidingWindowsCombineWithContext

2017-07-11 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082463#comment-16082463
 ] 

Kenneth Knowles edited comment on BEAM-2571 at 7/11/17 9:23 PM:


OK, I was wrong. Dataflow and Beam agree. We had an experiment going on with 
unrelated problems.


was (Author: kenn):
OK, I was wrong. Dataflow and Beam agree. We had an experiment going on where 
we swapped < for <= also :-)

> Flink ValidatesRunner failing CombineTest.testSlidingWindowsCombineWithContext
> --
>
> Key: BEAM-2571
> URL: https://issues.apache.org/jira/browse/BEAM-2571
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Kenneth Knowles
>Assignee: Aljoscha Krettek
> Fix For: 2.1.0
>
>
> This appears to have been caused by https://github.com/apache/beam/pull/3429 
> which fixes a couple errors in how trigger timers were processed / final 
> panes labeled.
> I am investigating, considering roll back vs forward fix. Since it is an 
> esoteric use case where I would advise users to use a stateful DoFn instead, 
> I think the bug fixed probably outweighs the bug introduced. I would like to 
> fix for 2.1.0 but will report back soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082887#comment-16082887
 ] 

Sourabh Bajaj edited comment on BEAM-2572 at 7/11/17 8:16 PM:
--

I think overall this plan makes sense but step 4 might be of unclear difficulty 
here as it depends on the runners propagating the pipeline options to all the 
workers correctly. It would be great to start a thread on the dev mailing list 
around this because we don't have a very clear story for credential passing to 
PTransforms yet.


was (Author: sb2nov):
I think overall this plan makes sense but step 4 might be of unclear difficulty 
here as it depends on the runners propagating the pipeline options to all the 
workers correctly. It would be great to start a thread around this because we 
don't have a very clear story for credential passing to PTransforms yet.

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK

2017-07-11 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082872#comment-16082872
 ] 

Ahmet Altay commented on BEAM-2573:
---

Thank you [~demeshchuk]. I agree with documenting this feature. Although it 
should be documented post release in order to not confuse 2.0.0 users.

cc: [~melap]

> Better filesystem discovery mechanism in Python SDK
> ---
>
> Key: BEAM-2573
> URL: https://issues.apache.org/jira/browse/BEAM-2573
> Project: Beam
>  Issue Type: Task
>  Components: runner-dataflow, sdk-py
>Affects Versions: 2.0.0
>Reporter: Dmitry Demeshchuk
>Priority: Minor
>
> It looks like right now custom filesystem classes have to be imported 
> explicitly: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
> Seems like the current implementation doesn't allow discovering filesystems 
> that come from side packages, not from apache_beam itself. Even if I put a 
> custom FileSystem-inheriting class into a package and explicitly import it in 
> the root __init__.py of that package, it still doesn't make the class 
> discoverable.
> The problems I'm experiencing happen on Dataflow runner, while Direct runner 
> works just fine. Here's an example of Dataflow output:
> {code}
>   (320418708fe777d7): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 581, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py",
>  line 54, in start
> self.output(windowed_value)
>   File "dataflow_worker/operations.py", line 138, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5768)
> def output(self, windowed_value, output_index=0):
>   File "dataflow_worker/operations.py", line 139, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5654)
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "dataflow_worker/operations.py", line 72, in 
> dataflow_worker.operations.ConsumerSet.receive 
> (dataflow_worker/operations.c:3506)
> cython.cast(Operation, consumer).process(windowed_value)
>   File "dataflow_worker/operations.py", line 328, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:11162)
> with self.scoped_process_state:
>   File "dataflow_worker/operations.py", line 329, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:6)
> self.dofn_receiver.receive(o)
>   File "apache_beam/runners/common.py", line 382, in 
> apache_beam.runners.common.DoFnRunner.receive 
> (apache_beam/runners/common.c:10156)
> self.process(windowed_value)
>   File "apache_beam/runners/common.py", line 390, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10458)
> self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 431, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented 
> (apache_beam/runners/common.c:11673)
> raise new_exn, None, original_traceback
>   File "apache_beam/runners/common.py", line 388, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10371)
> self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 281, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process 
> (apache_beam/runners/common.c:8626)
> self._invoke_per_window(windowed_value)
>   File "apache_beam/runners/common.py", line 307, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window 
> (apache_beam/runners/common.c:9302)
> windowed_value, self.process_method(*args_for_process))
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/transforms/core.py",
>  line 749, in 
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/io/iobase.py",
>  line 891, in 
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py",
>  line 109, in _f
> return fnc(self, *args, **kwargs)
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", 
> line 146, in initialize_write
> tmp_dir = self._create_temp_dir(file_path_prefix)
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", 
> line 151, in _create_temp_dir
> base_path, last_component = FileSystems.split(file_path_prefix)
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 
> 99, in split
> filesystem = 

[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK

2017-07-11 Thread Dmitry Demeshchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082850#comment-16082850
 ] 

Dmitry Demeshchuk commented on BEAM-2573:
-

Finally, after some fighting dependency issues, made it work, confirmed that it 
works on master as expected! Thanks a lot for all the help, folks!

I guess, since {{beam_plugins}} are the only sane solution to this problem 
overall, we can just make sure to reflect that in 
https://beam.apache.org/documentation/io/authoring-overview/ and call it a day?

> Better filesystem discovery mechanism in Python SDK
> ---
>
> Key: BEAM-2573
> URL: https://issues.apache.org/jira/browse/BEAM-2573
> Project: Beam
>  Issue Type: Task
>  Components: runner-dataflow, sdk-py
>Affects Versions: 2.0.0
>Reporter: Dmitry Demeshchuk
>Priority: Minor
>
> It looks like right now custom filesystem classes have to be imported 
> explicitly: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
> Seems like the current implementation doesn't allow discovering filesystems 
> that come from side packages, not from apache_beam itself. Even if I put a 
> custom FileSystem-inheriting class into a package and explicitly import it in 
> the root __init__.py of that package, it still doesn't make the class 
> discoverable.
> The problems I'm experiencing happen on Dataflow runner, while Direct runner 
> works just fine. Here's an example of Dataflow output:
> {code}
>   (320418708fe777d7): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 581, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py",
>  line 54, in start
> self.output(windowed_value)
>   File "dataflow_worker/operations.py", line 138, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5768)
> def output(self, windowed_value, output_index=0):
>   File "dataflow_worker/operations.py", line 139, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5654)
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "dataflow_worker/operations.py", line 72, in 
> dataflow_worker.operations.ConsumerSet.receive 
> (dataflow_worker/operations.c:3506)
> cython.cast(Operation, consumer).process(windowed_value)
>   File "dataflow_worker/operations.py", line 328, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:11162)
> with self.scoped_process_state:
>   File "dataflow_worker/operations.py", line 329, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:6)
> self.dofn_receiver.receive(o)
>   File "apache_beam/runners/common.py", line 382, in 
> apache_beam.runners.common.DoFnRunner.receive 
> (apache_beam/runners/common.c:10156)
> self.process(windowed_value)
>   File "apache_beam/runners/common.py", line 390, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10458)
> self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 431, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented 
> (apache_beam/runners/common.c:11673)
> raise new_exn, None, original_traceback
>   File "apache_beam/runners/common.py", line 388, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10371)
> self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 281, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process 
> (apache_beam/runners/common.c:8626)
> self._invoke_per_window(windowed_value)
>   File "apache_beam/runners/common.py", line 307, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window 
> (apache_beam/runners/common.c:9302)
> windowed_value, self.process_method(*args_for_process))
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/transforms/core.py",
>  line 749, in 
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/io/iobase.py",
>  line 891, in 
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py",
>  line 109, in _f
> return fnc(self, *args, **kwargs)
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", 
> line 146, in initialize_write
> tmp_dir = self._create_temp_dir(file_path_prefix)
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", 
> line 151, in 

[jira] [Created] (BEAM-2587) Build fails due to python sdk

2017-07-11 Thread Ahmet Altay (JIRA)
Ahmet Altay created BEAM-2587:
-

 Summary: Build fails due to python sdk
 Key: BEAM-2587
 URL: https://issues.apache.org/jira/browse/BEAM-2587
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Affects Versions: 2.1.0
Reporter: Ahmet Altay


Build fails with the following errors when {{mvn clean package}} is used on a 
clean Ubuntu 16.04 LTS machine with pip 8.x. The issue is resolved when pip is 
upgraded to pip 9.x

"RuntimeError: Not in apache git tree; unable to find proto definitions."
"DistutilsOptionError: can't combine user with prefix, exec_prefix/home, or 
install_(plat)base​"

We need to understand the issue and maybe add a note about requiring pip 9.x 
for development. Note that this does not affect end users using prepackaged 
artifacts from central repositories.

cc: [~robertwb]

Script for reproduction:

{code}
#!/bin/bash

set -e

readonly MACHINE_ID=$(hexdump -n 1 -e '"%x"' /dev/random)
readonly MACHINE="${USER}-beam-build-${MACHINE_ID}"
readonly ZONE="us-central1-c"

# provision building machine
echo "Provisioning Build Machine (Ubuntu 16.04 LTS)"
gcloud compute instances create "$MACHINE" \
  --zone="$ZONE" \
  --image-project="ubuntu-os-cloud" \
  --image-family="ubuntu-1604-lts"

# wait for ssh to be ready
echo "Waiting for machine to finish booting"
sleep 30

# ssh into the machine
# 1. install dependencies as specified by beam readme
# 2. download beam source from github
# 3. build with maven
echo "Downloading and building Apache Beam (release-2.1.0)"
gcloud compute ssh "$MACHINE" --zone="$ZONE" << EOF
sudo apt-get --assume-yes update
sudo apt-get --assume-yes install \
openjdk-8-jdk \
maven \
python-setuptools \
python-pip
wget https://github.com/apache/beam/archive/release-2.1.0.tar.gz
tar -xzf release-2.1.0.tar.gz
cd beam-release-2.1.0
mvn clean package
EOF
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is back to stable : beam_PostCommit_Java_MavenInstall #4352

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Updated] (BEAM-2585) add math function RAND_INTEGER and RAND

2017-07-11 Thread Xu Mingmin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Mingmin updated BEAM-2585:
-
Issue Type: Sub-task  (was: Task)
Parent: BEAM-2160

> add math function RAND_INTEGER and RAND
> ---
>
> Key: BEAM-2585
> URL: https://issues.apache.org/jira/browse/BEAM-2585
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: dsl_sql_merge
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2562) Add integration test for logical operators

2017-07-11 Thread Xu Mingmin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Mingmin reassigned BEAM-2562:


Assignee: Xu Mingmin

> Add integration test for logical operators
> --
>
> Key: BEAM-2562
> URL: https://issues.apache.org/jira/browse/BEAM-2562
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: James Xu
>Assignee: Xu Mingmin
>  Labels: dsl_sql_merge
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2586) Accommodate custom delimiters in TextIO

2017-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082716#comment-16082716
 ] 

ASF GitHub Bot commented on BEAM-2586:
--

GitHub user ChristophHebert opened a pull request:

https://github.com/apache/beam/pull/3543

[BEAM-2586] Accommodate custom delimiters in TextIO



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ChristophHebert/beam modifiedTextIO

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3543


commit 6277f69120da04e69189baf7b20940fc414dbbb8
Author: Chris Hebert 
Date:   2017-07-11T17:54:56Z

Accommodate custom delimiters in TextIO




> Accommodate custom delimiters in TextIO
> ---
>
> Key: BEAM-2586
> URL: https://issues.apache.org/jira/browse/BEAM-2586
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Christopher Hebert
>Assignee: Davor Bonaci
>Priority: Minor
>
> We frequently process text files delimited by something other than newlines, 
> including delimited only by end of file.
> First option:
> When we want to delimit by commas (or something else), we could use TextIO to 
> read in line by line and apply a transform to split each line on commas. When 
> we want to delimit by whole file, we could combine the elements of the 
> PCollection output from TextIO that come from the same file into one element.
> Second option:
> Alternatively to complicating (and slowing) our pipelines with the methods 
> above, we could write custom FileBasedSources for each use case.
> Third option:
> Preferably, we'd like to generalize TextIO to accept delimiters other than 
> the default: \n, \r, \r\n.
> I'll attach a pull request for how we envision this generalization of TextIO 
> to look.
> If this is not the direction Beam would like to go with TextIO, then we'll 
> stick to maintaining our own TextIO or our own FileBasedSources to achieve 
> this functionality.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2585) add math function RAND_INTEGER and RAND

2017-07-11 Thread Xu Mingmin (JIRA)
Xu Mingmin created BEAM-2585:


 Summary: add math function RAND_INTEGER and RAND
 Key: BEAM-2585
 URL: https://issues.apache.org/jira/browse/BEAM-2585
 Project: Beam
  Issue Type: Task
  Components: dsl-sql
Reporter: Xu Mingmin
Assignee: Xu Mingmin






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3543: [BEAM-2586] Accommodate custom delimiters in TextIO

2017-07-11 Thread ChristophHebert
GitHub user ChristophHebert opened a pull request:

https://github.com/apache/beam/pull/3543

[BEAM-2586] Accommodate custom delimiters in TextIO



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ChristophHebert/beam modifiedTextIO

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3543


commit 6277f69120da04e69189baf7b20940fc414dbbb8
Author: Chris Hebert 
Date:   2017-07-11T17:54:56Z

Accommodate custom delimiters in TextIO




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-2586) Accommodate custom delimiters in TextIO

2017-07-11 Thread Christopher Hebert (JIRA)
Christopher Hebert created BEAM-2586:


 Summary: Accommodate custom delimiters in TextIO
 Key: BEAM-2586
 URL: https://issues.apache.org/jira/browse/BEAM-2586
 Project: Beam
  Issue Type: New Feature
  Components: sdk-java-core
Reporter: Christopher Hebert
Assignee: Davor Bonaci
Priority: Minor


We frequently process text files delimited by something other than newlines, 
including delimited only by end of file.

First option:
When we want to delimit by commas (or something else), we could use TextIO to 
read in line by line and apply a transform to split each line on commas. When 
we want to delimit by whole file, we could combine the elements of the 
PCollection output from TextIO that come from the same file into one element.

Second option:
Alternatively to complicating (and slowing) our pipelines with the methods 
above, we could write custom FileBasedSources for each use case.

Third option:
Preferably, we'd like to generalize TextIO to accept delimiters other than the 
default: \n, \r, \r\n.

I'll attach a pull request for how we envision this generalization of TextIO to 
look.

If this is not the direction Beam would like to go with TextIO, then we'll 
stick to maintaining our own TextIO or our own FileBasedSources to achieve this 
functionality.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Spark #2608

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Assigned] (BEAM-2563) Add integration test for math operators

2017-07-11 Thread Xu Mingmin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Mingmin reassigned BEAM-2563:


Assignee: Xu Mingmin

> Add integration test for math operators
> ---
>
> Key: BEAM-2563
> URL: https://issues.apache.org/jira/browse/BEAM-2563
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: James Xu
>Assignee: Xu Mingmin
>  Labels: dsl_sql_merge
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build became unstable: beam_PostCommit_Java_MavenInstall #4351

2017-07-11 Thread Apache Jenkins Server
See 




Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Flink #3374

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Resolved] (BEAM-115) Beam Runner API

2017-07-11 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-115.
--
   Resolution: Resolved
Fix Version/s: Not applicable

I think this umbrella ticket is no longer adding value. The Runner API is a 
subproject with lots of threads.

> Beam Runner API
> ---
>
> Key: BEAM-115
> URL: https://issues.apache.org/jira/browse/BEAM-115
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model-runner-api
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
> Fix For: Not applicable
>
>
> The PipelineRunner API from the SDK is not ideal for the Beam technical 
> vision.
> It has technical limitations:
>  - The user's DAG (even including library expansions) is never explicitly 
> represented, so it cannot be analyzed except incrementally, and cannot 
> necessarily be reconstructed (for example, to display it!).
>  - The flattened DAG of just primitive transforms isn't well-suited for 
> display or transform override.
>  - The TransformHierarchy isn't well-suited for optimizations.
>  - The user must realistically pre-commit to a runner, and its configuration 
> (batch vs streaming) prior to graph construction, since the runner will be 
> modifying the graph as it is built.
>  - It is fairly language- and SDK-specific.
> It has usability issues (these are not from intuition, but derived from 
> actual cases of failure to use according to the design)
>  - The interleaving of apply() methods in PTransform/Pipeline/PipelineRunner 
> is confusing.
>  - The TransformHierarchy, accessible only via visitor traversals, is 
> cumbersome.
>  - The staging of construction-time vs run-time is not always obvious.
> These are just examples. This ticket tracks designing, coming to consensus, 
> and building an API that more simply and directly supports the technical 
> vision.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Spark #2607

2017-07-11 Thread Apache Jenkins Server
See 




[GitHub] beam pull request #3542: Remove dead (and wrong) viewFromProto overload

2017-07-11 Thread kennknowles
GitHub user kennknowles opened a pull request:

https://github.com/apache/beam/pull/3542

Remove dead (and wrong) viewFromProto overload

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`.
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.pdf).

---

@tgroh I failed to actually delete this unused method in resolving 
conflicts on #3509.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/beam viewFromProto

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3542.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3542


commit 5f8e4e1fca957c140d4bea62929d91bbe4dd8693
Author: Kenneth Knowles 
Date:   2017-07-11T17:09:12Z

Remove dead (and wrong) viewFromProto overload




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-2584) Create a JCache based IO

2017-07-11 Thread JIRA
Ismaël Mejía created BEAM-2584:
--

 Summary: Create a JCache based IO
 Key: BEAM-2584
 URL: https://issues.apache.org/jira/browse/BEAM-2584
 Project: Beam
  Issue Type: New Feature
  Components: sdk-java-extensions
Affects Versions: Not applicable
Reporter: Ismaël Mejía
Assignee: Davor Bonaci
Priority: Minor


Given the current work to support IOs for in-memory data stores like Redis and 
Memcached it makes probably sense to have a connector for the common API that 
Java has for this type of systems, with this we can support other in-memory 
grids like Apache Ignite, Hazelcast, etc.
Note: I am not 100% sure if the API has a generic way to discover at least per 
server splits but this is worth exploring.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Flink #3373

2017-07-11 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-2558) Add a CombineFnTester

2017-07-11 Thread Thomas Groh (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082503#comment-16082503
 ] 

Thomas Groh commented on BEAM-2558:
---

The linked file is also only in the Testing JAR, so user's can't use it.

> Add a CombineFnTester
> -
>
> Key: BEAM-2558
> URL: https://issues.apache.org/jira/browse/BEAM-2558
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Thomas Groh
>Assignee: Thomas Groh
>
> This should be in the style of {{CoderProperties}} or {{EqualsTester}}: the 
> user should provide some inputs, and it should exercise potential paths by 
> which those elements may be accumulated and ensure that they all produce the 
> same results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2584) Create a JCache based IO

2017-07-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/BEAM-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reassigned BEAM-2584:
--

Assignee: Ismaël Mejía  (was: Davor Bonaci)

> Create a JCache based IO
> 
>
> Key: BEAM-2584
> URL: https://issues.apache.org/jira/browse/BEAM-2584
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Affects Versions: Not applicable
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
>
> Given the current work to support IOs for in-memory data stores like Redis 
> and Memcached it makes probably sense to have a connector for the common API 
> that Java has for this type of systems, with this we can support other 
> in-memory grids like Apache Ignite, Hazelcast, etc.
> Note: I am not 100% sure if the API has a generic way to discover at least 
> per server splits but this is worth exploring.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2556) Client-side throttling for Datastore connector

2017-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082489#comment-16082489
 ] 

ASF GitHub Bot commented on BEAM-2556:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3507


> Client-side throttling for Datastore connector
> --
>
> Key: BEAM-2556
> URL: https://issues.apache.org/jira/browse/BEAM-2556
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-gcp
>Reporter: Colin Phipps
>Assignee: Colin Phipps
>Priority: Minor
>  Labels: datastore
>
> The Datastore connector currently has exponential backoff on errors, which is 
> good. But it does not do any other throttling of its write load in response 
> to errors; once a request succeeds, it resumes writing as quickly as it can.
> Write loads will be more stable and more likely to compete if the client 
> throttles itself in the event that it receives high rates of errors from the 
> Datastore service; specifically 
> https://landing.google.com/sre/book/chapters/handling-overload.html#client-side-throttling-a7sYUg
>  is a technique that Google has had success with on other services.
> We (Datastore) have a patch in progress to add this behaviour to the 
> connector.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2583) Exceptions in expand() corrupt the TransformHierarchy

2017-07-11 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082486#comment-16082486
 ] 

Kenneth Knowles commented on BEAM-2583:
---

Is it a bug to catch an exception and continue to use the pipeline? I would say 
that is a completely normal thing to do, unless we explicitly document against 
it. And even docs are a thin excuse - hence I think we should make the Pipeline 
forcibly unusable.

> Exceptions in expand() corrupt the TransformHierarchy
> -
>
> Key: BEAM-2583
> URL: https://issues.apache.org/jira/browse/BEAM-2583
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Priority: Minor
>
> Discovered recently during unrelated work that when a composite PTransform 
> (Java) crashes during expand - for example due to a validation failure - it 
> leaves the Pipeline in a state that throws NPEs when it is traversed.
> No one has complained, so it doesn't seem critical to fix the mutation logic. 
> But we should set a bit and poison all future uses of the corrupted pipeline.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3507: [BEAM-2556] Add client-side throttling.

2017-07-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3507


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[1/2] beam git commit: Add client-side throttling.

2017-07-11 Thread lcwik
Repository: beam
Updated Branches:
  refs/heads/master 138641f14 -> a22f1a05a


Add client-side throttling.

The approach used is as described in
https://landing.google.com/sre/book/chapters/handling-overload.html#client-side-throttling-a7sYUg
. By backing off individual workers in response to high error rates, we relieve
pressure on the Datastore service, increasing the chance that the workload can
complete successfully.

The exported cumulativeThrottledSeconds could also be used as an autoscaling
signal in future.


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/435cbcfa
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/435cbcfa
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/435cbcfa

Branch: refs/heads/master
Commit: 435cbcfae67c3a2bc8a72437fbdf7350fe5ac10a
Parents: 138641f
Author: Colin Phipps 
Authored: Mon Jun 26 13:34:19 2017 +
Committer: Luke Cwik 
Committed: Tue Jul 11 09:40:11 2017 -0700

--
 .../sdk/io/gcp/datastore/AdaptiveThrottler.java | 103 +
 .../beam/sdk/io/gcp/datastore/DatastoreV1.java  |  25 -
 .../io/gcp/datastore/AdaptiveThrottlerTest.java | 111 +++
 3 files changed, 238 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/435cbcfa/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/AdaptiveThrottler.java
--
diff --git 
a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/AdaptiveThrottler.java
 
b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/AdaptiveThrottler.java
new file mode 100644
index 000..ce6ebe6
--- /dev/null
+++ 
b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/AdaptiveThrottler.java
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.gcp.datastore;
+
+import com.google.common.annotations.VisibleForTesting;
+import java.util.Random;
+import org.apache.beam.sdk.transforms.Sum;
+import org.apache.beam.sdk.util.MovingFunction;
+
+
+/**
+ * An implementation of client-side adaptive throttling. See
+ * 
https://landing.google.com/sre/book/chapters/handling-overload.html#client-side-throttling-a7sYUg
+ * for a full discussion of the use case and algorithm applied.
+ */
+class AdaptiveThrottler {
+  private final MovingFunction successfulRequests;
+  private final MovingFunction allRequests;
+
+  /** The target ratio between requests sent and successful requests. This is 
"K" in the formula in
+   * https://landing.google.com/sre/book/chapters/handling-overload.html */
+  private final double overloadRatio;
+
+  /** The target minimum number of requests per samplePeriodMs, even if no 
requests succeed. Must be
+   * greater than 0, else we could throttle to zero. Because every decision is 
probabilistic, there
+   * is no guarantee that the request rate in any given interval will not be 
zero. (This is the +1
+   * from the formula in 
https://landing.google.com/sre/book/chapters/handling-overload.html */
+  private static final double MIN_REQUESTS = 1;
+  private final Random random;
+
+  /**
+   * @param samplePeriodMs the time window to keep of request history to 
inform throttling
+   * decisions.
+   * @param sampleUpdateMs the length of buckets within this time window.
+   * @param overloadRatio the target ratio between requests sent and 
successful requests. You should
+   * always set this to more than 1, otherwise the client would never try to 
send more requests than
+   * succeeded in the past - so it could never recover from temporary setbacks.
+   */
+  public AdaptiveThrottler(long samplePeriodMs, long sampleUpdateMs,
+  double overloadRatio) {
+this(samplePeriodMs, sampleUpdateMs, overloadRatio, new Random());
+  }
+
+  @VisibleForTesting
+  AdaptiveThrottler(long samplePeriodMs, long sampleUpdateMs,
+  double 

[2/2] beam git commit: [BEAM-2556] Add client-side throttling.

2017-07-11 Thread lcwik
[BEAM-2556] Add client-side throttling.

This closes #3507


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/a22f1a05
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/a22f1a05
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/a22f1a05

Branch: refs/heads/master
Commit: a22f1a05a2d4d4ff643a7cb175542a635a8b8c02
Parents: 138641f 435cbcf
Author: Luke Cwik 
Authored: Tue Jul 11 09:40:16 2017 -0700
Committer: Luke Cwik 
Committed: Tue Jul 11 09:40:16 2017 -0700

--
 .../sdk/io/gcp/datastore/AdaptiveThrottler.java | 103 +
 .../beam/sdk/io/gcp/datastore/DatastoreV1.java  |  25 -
 .../io/gcp/datastore/AdaptiveThrottlerTest.java | 111 +++
 3 files changed, 238 insertions(+), 1 deletion(-)
--




[jira] [Commented] (BEAM-2583) Exceptions in expand() corrupt the TransformHierarchy

2017-07-11 Thread Thomas Groh (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082464#comment-16082464
 ] 

Thomas Groh commented on BEAM-2583:
---

Specifically, this only happens if, during a call to {{expand}}, the call 
throws an exception; that exception is caught or otherwise handled, and then 
the user attempts to continue to use the pipeline. This is very much a bug in 
the user's code, but it may fail in a non-intuitive sense (i.e. via 
NullPointerException)

> Exceptions in expand() corrupt the TransformHierarchy
> -
>
> Key: BEAM-2583
> URL: https://issues.apache.org/jira/browse/BEAM-2583
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Priority: Minor
>
> Discovered recently during unrelated work that when a composite PTransform 
> (Java) crashes during expand - for example due to a validation failure - it 
> leaves the Pipeline in a state that throws NPEs when it is traversed.
> No one has complained, so it doesn't seem critical to fix the mutation logic. 
> But we should set a bit and poison all future uses of the corrupted pipeline.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >