[jira] [Comment Edited] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

2017-09-13 Thread Jacob Marble (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165391#comment-16165391
 ] 

Jacob Marble edited comment on BEAM-2500 at 9/14/17 4:57 AM:
-

I'm interested in implementing S3 support. Not being familiar Beam internals, 
and without committing myself to anything, perhaps someone can comment on my 
research notes.

GCS is probably a good template. Implement FileSystem, ResourceId, 
FileSystemRegistrar, PipelineOptions, PipelineOptionsRegistrar:
https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

For interacting with S3, this is probably the preferred SDK:
http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

Some specifics about implementing FileSystem:

FileSystem.copy()
- AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String 
destinationBucketName, String destinationKey)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.create()
- AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream 
input, ObjectMetadata metadata)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.delete()
- AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

FileSystem.getScheme()
- return "s3"

FileSystem.match()
- j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some 
good ideas

FileSystem.matchNewResource()
- Look at GcsPath and GcsUtil

FileSystem.open()
- AmazonS3Client.getObject(String bucketName, String key)

FileSystem.rename()
- Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then 
FileSystem.delete()

I'm not clear about how to register the s3 FileSystem as mentioned in the 
FileSystemRegistrar Javadoc:

"FileSystem creators have the ability to provide a registrar by creating a 
ServiceLoader entry and a concrete implementation of this interface.

It is optional but recommended to use one of the many build time tools such as 
AutoService to generate the necessary META-INF files automatically."


was (Author: jmarble):
I'm interested in implementing S3 support. Not being familiar Beam internals, 
and without committing myself to anything, perhaps someone can comment on my 
research notes.

GCS is probably a good template. Implement FileSystem, ResourceId, 
FileSystemRegistrar, PathValidator, PipelineOptions, PipelineOptionsRegistrar:
https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

For interacting with S3, this is probably the preferred SDK:
http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

Some specifics about implementing FileSystem:

FileSystem.copy()
- AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String 
destinationBucketName, String destinationKey)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.create()
- AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream 
input, ObjectMetadata metadata)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.delete()
- AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

FileSystem.getScheme()
- return "s3"

FileSystem.match()
- j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some 
good ideas

FileSystem.matchNewResource()
- Look at GcsPath and GcsUtil

FileSystem.open()
- AmazonS3Client.getObject(String bucketName, String key)

FileSystem.rename()
- Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then 
FileSystem.delete()

I'm not clear about how to register the s3 FileSystem as mentioned in the 
FileSystemRegistrar Javadoc:

"FileSystem creators have the ability to provide a registrar by creating a 
ServiceLoader entry and a concrete implementation of this interface.

It is optional but recommended to use one of the many build time tools such as 
AutoService to generate the necessary META-INF files automatically."

> Add support for S3 as a Apache Beam FileSystem
> --
>
> Key: BEAM-2500
> URL: https://issues.apache.org/jira/browse/BEAM-2500
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Luke Cwik
>Priority: Minor
> Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File 

[jira] [Commented] (BEAM-2955) Create a Cloud Bigtable HBase connector

2017-09-13 Thread Solomon Duskis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165523#comment-16165523
 ] 

Solomon Duskis commented on BEAM-2955:
--

Chamikra: HBaseIO will have to be extended or wrapped.  Cloud Bigtable needs 
slightly different configuration options, has a different way to calculate 
estimated sizes, and needs templating.  The interface would essentially be the 
same whether we leverage HBaseIO or BigtableIO.  The BigtableIO wrapper that I 
wrote was 271 lines of code.  

I'll create a PR for the BigtableIO wrapper in the Beam github project, since 
the code is already written.
I'll also create a PR for an extension of HBaseIO.

That way, we can compare the two options.

> Create a Cloud Bigtable HBase connector
> ---
>
> Key: BEAM-2955
> URL: https://issues.apache.org/jira/browse/BEAM-2955
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-gcp
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>
> The Cloud Bigtable (CBT) team has had a Dataflow connector maintained in a 
> different repo for awhile. Recently, we did some reworking of the Cloud 
> Bigtable client that would allow it to better coexist in the Beam ecosystem, 
> and we also released a Beam connector in our repository that exposes HBase 
> idioms rather than the Protobuf idioms of BigtableIO.  More information about 
> the customer experience of the HBase connector can be found here: 
> [https://cloud.google.com/bigtable/docs/dataflow-hbase].
> The Beam repo is a much better place to house a Cloud Bigtable HBase 
> connector.  There are a couple of ways we can implement this new connector:
> # The CBT connector depends on artifacts in the io/hbase maven project.  We 
> can create a new extend HBaseIO for the purposes of CBT.  We would have to 
> add some features to HBaseIO to make that work (dynamic rebalancing, and a 
> way for HBase and CBT's size estimation models to coexist)
> # The BigtableIO connector works well, and we can add an adapter layer on top 
> of it.  I have a proof of concept of it here: 
> [https://github.com/sduskis/cloud-bigtable-client/tree/add_beam/bigtable-dataflow-parent/bigtable-hbase-beam].
> # We can build a separate CBT HBase connector.
> I'm happy to do the work.  I would appreciate some guidance and discussion 
> about the right approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2956) DataflowRunner incorrectly reports the user agent for the Dataflow distribution

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165502#comment-16165502
 ] 

ASF GitHub Bot commented on BEAM-2956:
--

GitHub user lukecwik opened a pull request:

https://github.com/apache/beam/pull/3849

[BEAM-2956] Attempt to correctly report the Dataflow distribution in GCP 
related modules.

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [x] Each commit in the pull request should have a meaningful subject 
line and body.
 - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lukecwik/incubator-beam master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3849.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3849


commit e9176e9939f0dd61789718203cca95ed760e0954
Author: Luke Cwik 
Date:   2017-09-14T00:00:04Z

[BEAM-2956] Attempt to correctly report the Dataflow distribution in GCP 
related modules.




> DataflowRunner incorrectly reports the user agent for the Dataflow 
> distribution
> ---
>
> Key: BEAM-2956
> URL: https://issues.apache.org/jira/browse/BEAM-2956
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Luke Cwik
>Assignee: Luke Cwik
> Fix For: 2.2.0
>
>
> The DataflowRunner when distributed with the Dataflow SDK distribution may 
> incorrectly submit a user agent and properties from the Apache Beam 
> distribution.
> This occurs when the Apache Beam jars appear on the classpath before the 
> Dataflow SDK distribution. The fix is to not have two files at the same path 
> but to use two different paths, where the lack of the second path means that 
> we are using the Apache Beam distribution and its existence implies we are 
> using the Dataflow distribution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3849: [BEAM-2956] Attempt to correctly report the Dataflo...

2017-09-13 Thread lukecwik
GitHub user lukecwik opened a pull request:

https://github.com/apache/beam/pull/3849

[BEAM-2956] Attempt to correctly report the Dataflow distribution in GCP 
related modules.

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [x] Each commit in the pull request should have a meaningful subject 
line and body.
 - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lukecwik/incubator-beam master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3849.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3849


commit e9176e9939f0dd61789718203cca95ed760e0954
Author: Luke Cwik 
Date:   2017-09-14T00:00:04Z

[BEAM-2956] Attempt to correctly report the Dataflow distribution in GCP 
related modules.




---


[jira] [Updated] (BEAM-2956) DataflowRunner incorrectly reports the user agent for the Dataflow distribution

2017-09-13 Thread Luke Cwik (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik updated BEAM-2956:

Description: 
The DataflowRunner when distributed with the Dataflow SDK distribution may 
incorrectly submit a user agent and properties from the Apache Beam 
distribution.

This occurs when the Apache Beam jars appear on the classpath before the 
Dataflow SDK distribution. The fix is to not have two files at the same path 
but to use two different paths, where the lack of the second path means that we 
are using the Apache Beam distribution and its existence implies we are using 
the Dataflow distribution.

> DataflowRunner incorrectly reports the user agent for the Dataflow 
> distribution
> ---
>
> Key: BEAM-2956
> URL: https://issues.apache.org/jira/browse/BEAM-2956
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Luke Cwik
>Assignee: Luke Cwik
> Fix For: 2.2.0
>
>
> The DataflowRunner when distributed with the Dataflow SDK distribution may 
> incorrectly submit a user agent and properties from the Apache Beam 
> distribution.
> This occurs when the Apache Beam jars appear on the classpath before the 
> Dataflow SDK distribution. The fix is to not have two files at the same path 
> but to use two different paths, where the lack of the second path means that 
> we are using the Apache Beam distribution and its existence implies we are 
> using the Dataflow distribution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2955) Create a Cloud Bigtable HBase connector

2017-09-13 Thread Solomon Duskis (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165431#comment-16165431
 ] 

Solomon Duskis commented on BEAM-2955:
--

It's awesome that you added the Dynamic rebalancing!  I'm ok with extending 
HBaseIO, as long as there aren't any other overriding concerns.  I'd like to 
explore the possibility of templates (ValueProviders) as the configuration of 
HBaseIO.

> Create a Cloud Bigtable HBase connector
> ---
>
> Key: BEAM-2955
> URL: https://issues.apache.org/jira/browse/BEAM-2955
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-gcp
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>
> The Cloud Bigtable (CBT) team has had a Dataflow connector maintained in a 
> different repo for awhile. Recently, we did some reworking of the Cloud 
> Bigtable client that would allow it to better coexist in the Beam ecosystem, 
> and we also released a Beam connector in our repository that exposes HBase 
> idioms rather than the Protobuf idioms of BigtableIO.  More information about 
> the customer experience of the HBase connector can be found here: 
> [https://cloud.google.com/bigtable/docs/dataflow-hbase].
> The Beam repo is a much better place to house a Cloud Bigtable HBase 
> connector.  There are a couple of ways we can implement this new connector:
> # The CBT connector depends on artifacts in the io/hbase maven project.  We 
> can create a new extend HBaseIO for the purposes of CBT.  We would have to 
> add some features to HBaseIO to make that work (dynamic rebalancing, and a 
> way for HBase and CBT's size estimation models to coexist)
> # The BigtableIO connector works well, and we can add an adapter layer on top 
> of it.  I have a proof of concept of it here: 
> [https://github.com/sduskis/cloud-bigtable-client/tree/add_beam/bigtable-dataflow-parent/bigtable-hbase-beam].
> # We can build a separate CBT HBase connector.
> I'm happy to do the work.  I would appreciate some guidance and discussion 
> about the right approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2956) DataflowRunner incorrectly reports the user agent for the Dataflow distribution

2017-09-13 Thread Luke Cwik (JIRA)
Luke Cwik created BEAM-2956:
---

 Summary: DataflowRunner incorrectly reports the user agent for the 
Dataflow distribution
 Key: BEAM-2956
 URL: https://issues.apache.org/jira/browse/BEAM-2956
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Affects Versions: 2.1.0, 2.0.0
Reporter: Luke Cwik
Assignee: Luke Cwik
 Fix For: 2.2.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2500) Add support for S3 as a Apache Beam FileSystem

2017-09-13 Thread Jacob Marble (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165391#comment-16165391
 ] 

Jacob Marble commented on BEAM-2500:


I'm interested in implementing S3 support. Not being familiar Beam internals, 
and without committing myself to anything, perhaps someone can comment on my 
research notes.

GCS is probably a good template. Implement FileSystem, ResourceId, 
FileSystemRegistrar, PathValidator, PipelineOptions, PipelineOptionsRegistrar:
https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

For interacting with S3, this is probably the preferred SDK:
http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

Some specifics about implementing FileSystem:

FileSystem.copy()
- AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String 
destinationBucketName, String destinationKey)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.create()
- AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream 
input, ObjectMetadata metadata)
- max upload size is 5GB, which is probably fine to start, but need to use 
multipart upload to get full 5TB limit

FileSystem.delete()
- AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

FileSystem.getScheme()
- return "s3"

FileSystem.match()
- j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some 
good ideas

FileSystem.matchNewResource()
- Look at GcsPath and GcsUtil

FileSystem.open()
- AmazonS3Client.getObject(String bucketName, String key)

FileSystem.rename()
- Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then 
FileSystem.delete()

I'm not clear about how to register the s3 FileSystem as mentioned in the 
FileSystemRegistrar Javadoc:

"FileSystem creators have the ability to provide a registrar by creating a 
ServiceLoader entry and a concrete implementation of this interface.

It is optional but recommended to use one of the many build time tools such as 
AutoService to generate the necessary META-INF files automatically."

> Add support for S3 as a Apache Beam FileSystem
> --
>
> Key: BEAM-2500
> URL: https://issues.apache.org/jira/browse/BEAM-2500
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Luke Cwik
>Priority: Minor
> Attachments: hadoop_fs_patch.patch
>
>
> Note that this is for providing direct integration with S3 as an Apache Beam 
> FileSystem.
> There is already support for using the Hadoop S3 connector by depending on 
> the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with 
> a S3 configuration[3].
> 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
> 2: 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
> 3: https://wiki.apache.org/hadoop/AmazonS3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2955) Create a Cloud Bigtable HBase connector

2017-09-13 Thread Chamikara Jayalath (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reassigned BEAM-2955:


Assignee: Solomon Duskis  (was: Chamikara Jayalath)

> Create a Cloud Bigtable HBase connector
> ---
>
> Key: BEAM-2955
> URL: https://issues.apache.org/jira/browse/BEAM-2955
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-gcp
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>
> The Cloud Bigtable (CBT) team has had a Dataflow connector maintained in a 
> different repo for awhile. Recently, we did some reworking of the Cloud 
> Bigtable client that would allow it to better coexist in the Beam ecosystem, 
> and we also released a Beam connector in our repository that exposes HBase 
> idioms rather than the Protobuf idioms of BigtableIO.  More information about 
> the customer experience of the HBase connector can be found here: 
> [https://cloud.google.com/bigtable/docs/dataflow-hbase].
> The Beam repo is a much better place to house a Cloud Bigtable HBase 
> connector.  There are a couple of ways we can implement this new connector:
> # The CBT connector depends on artifacts in the io/hbase maven project.  We 
> can create a new extend HBaseIO for the purposes of CBT.  We would have to 
> add some features to HBaseIO to make that work (dynamic rebalancing, and a 
> way for HBase and CBT's size estimation models to coexist)
> # The BigtableIO connector works well, and we can add an adapter layer on top 
> of it.  I have a proof of concept of it here: 
> [https://github.com/sduskis/cloud-bigtable-client/tree/add_beam/bigtable-dataflow-parent/bigtable-hbase-beam].
> # We can build a separate CBT HBase connector.
> I'm happy to do the work.  I would appreciate some guidance and discussion 
> about the right approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2955) Create a Cloud Bigtable HBase connector

2017-09-13 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165347#comment-16165347
 ] 

Chamikara Jayalath commented on BEAM-2955:
--

I like approach (1) as well since seems like it'll minimize code duplication. 
Are there any drawbacks (features, performance) of approach (1) compared to 
approach (2) ?

Assigning this JIRA to you.

Also ccing some folks interested in I/O.
[~reuvenlax] [~jkff] [~jbonofre]

> Create a Cloud Bigtable HBase connector
> ---
>
> Key: BEAM-2955
> URL: https://issues.apache.org/jira/browse/BEAM-2955
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-gcp
>Reporter: Solomon Duskis
>Assignee: Chamikara Jayalath
>
> The Cloud Bigtable (CBT) team has had a Dataflow connector maintained in a 
> different repo for awhile. Recently, we did some reworking of the Cloud 
> Bigtable client that would allow it to better coexist in the Beam ecosystem, 
> and we also released a Beam connector in our repository that exposes HBase 
> idioms rather than the Protobuf idioms of BigtableIO.  More information about 
> the customer experience of the HBase connector can be found here: 
> [https://cloud.google.com/bigtable/docs/dataflow-hbase].
> The Beam repo is a much better place to house a Cloud Bigtable HBase 
> connector.  There are a couple of ways we can implement this new connector:
> # The CBT connector depends on artifacts in the io/hbase maven project.  We 
> can create a new extend HBaseIO for the purposes of CBT.  We would have to 
> add some features to HBaseIO to make that work (dynamic rebalancing, and a 
> way for HBase and CBT's size estimation models to coexist)
> # The BigtableIO connector works well, and we can add an adapter layer on top 
> of it.  I have a proof of concept of it here: 
> [https://github.com/sduskis/cloud-bigtable-client/tree/add_beam/bigtable-dataflow-parent/bigtable-hbase-beam].
> # We can build a separate CBT HBase connector.
> I'm happy to do the work.  I would appreciate some guidance and discussion 
> about the right approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2954) update shade configurations in extension/sql

2017-09-13 Thread Xu Mingmin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Mingmin updated BEAM-2954:
-
Summary: update shade configurations in extension/sql  (was: opt-out shade 
in extension/sql)

> update shade configurations in extension/sql
> 
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
>   at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2954) update shade configurations in extension/sql

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165344#comment-16165344
 ] 

ASF GitHub Bot commented on BEAM-2954:
--

GitHub user XuMingmin opened a pull request:

https://github.com/apache/beam/pull/3848

[BEAM-2954] update shade configurations in extension/sql

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/XuMingmin/beam BEAM-2954

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3848.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3848


commit 522e6c188809327f098b84dbd7ee07e1dad4a376
Author: mingmxu 
Date:   2017-09-13T17:31:19Z

update shade settings to handle Calcite dependencies




> update shade configurations in extension/sql
> 
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
>   at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3848: [BEAM-2954] update shade configurations in extensio...

2017-09-13 Thread XuMingmin
GitHub user XuMingmin opened a pull request:

https://github.com/apache/beam/pull/3848

[BEAM-2954] update shade configurations in extension/sql

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/XuMingmin/beam BEAM-2954

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3848.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3848


commit 522e6c188809327f098b84dbd7ee07e1dad4a376
Author: mingmxu 
Date:   2017-09-13T17:31:19Z

update shade settings to handle Calcite dependencies




---


[jira] [Commented] (BEAM-2954) opt-out shade in extension/sql

2017-09-13 Thread Xu Mingmin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165338#comment-16165338
 ] 

Xu Mingmin commented on BEAM-2954:
--

make sense to me, I'll turn to shade Calcite dependencies.

> opt-out shade in extension/sql
> --
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
>   at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2954) opt-out shade in extension/sql

2017-09-13 Thread Aviem Zur (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165183#comment-16165183
 ] 

Aviem Zur edited comment on BEAM-2954 at 9/13/17 8:25 PM:
--

We cannot opt-out of the shading rules in the SQL module as we cannot leak 
{{Guava}} to our users.

If a shaded version of {{Calcite}} which relocates {{Guava}} exists we should 
use that (such a version does not seem to exist). Otherwise, since {{Calcite}} 
is under {{Apache License 2.0}} we can shade it into our SQL module, using the 
same relocation rules applied in {{beam-parent}}.


was (Author: aviemzur):
We cannot opt-out of the shading rules in the SQL module as we cannot leak 
{{Guava}} to our users.

If a shaded version of {{Calcite}} exists which relocates {{Guava}} we should 
use that (such a version does not seem to exist). Otherwise, since {{Calcite}} 
is under {{Apache License 2.0}} we can shade it into our SQL module, using the 
same relocation rules applied in {{beam-parent}}.

> opt-out shade in extension/sql
> --
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
>   at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2955) Create a Cloud Bigtable HBase connector

2017-09-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165196#comment-16165196
 ] 

Ismaël Mejía commented on BEAM-2955:


This is a great idea [~sduskis]. I will be pro (1) but I can understand if the 
current BigtableIO maintainers prefer (2). If we go the io/hbase route I would 
really like that we share as much code as possible, and of course we can 
refactor it freely to adapt the other client.
About the missing parts, I added Dynamic Work Rebalancing to it recently (not 
sure if this is what you refer to). And I agree that size estimation should be 
separated. I am OOO until beginning of october, but you can count on me for 
anything related to this.

> Create a Cloud Bigtable HBase connector
> ---
>
> Key: BEAM-2955
> URL: https://issues.apache.org/jira/browse/BEAM-2955
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-gcp
>Reporter: Solomon Duskis
>Assignee: Chamikara Jayalath
>
> The Cloud Bigtable (CBT) team has had a Dataflow connector maintained in a 
> different repo for awhile. Recently, we did some reworking of the Cloud 
> Bigtable client that would allow it to better coexist in the Beam ecosystem, 
> and we also released a Beam connector in our repository that exposes HBase 
> idioms rather than the Protobuf idioms of BigtableIO.  More information about 
> the customer experience of the HBase connector can be found here: 
> [https://cloud.google.com/bigtable/docs/dataflow-hbase].
> The Beam repo is a much better place to house a Cloud Bigtable HBase 
> connector.  There are a couple of ways we can implement this new connector:
> # The CBT connector depends on artifacts in the io/hbase maven project.  We 
> can create a new extend HBaseIO for the purposes of CBT.  We would have to 
> add some features to HBaseIO to make that work (dynamic rebalancing, and a 
> way for HBase and CBT's size estimation models to coexist)
> # The BigtableIO connector works well, and we can add an adapter layer on top 
> of it.  I have a proof of concept of it here: 
> [https://github.com/sduskis/cloud-bigtable-client/tree/add_beam/bigtable-dataflow-parent/bigtable-hbase-beam].
> # We can build a separate CBT HBase connector.
> I'm happy to do the work.  I would appreciate some guidance and discussion 
> about the right approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2954) opt-out shade in extension/sql

2017-09-13 Thread Aviem Zur (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aviem Zur updated BEAM-2954:

Description: 
{{guava}} is shaded by default in beam modules, while {{calcite}}(a dependency 
of SQL) includes {{guava}} which is not shaded. Below error is thrown when 
calling {{calcite}} methods with guava classed.

{code}
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
at 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
at 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
at 
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
at 
org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
{code}

  was:
{{guava}} is shaded by default in beam modules, while {{calcite}}(a dependency 
of SQL) includes {{guava}} which is not shaded. Below error is thrown when 
calling {{calcite}} methods with guava classed.

{code}
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
at 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
at 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
at 
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
at 
org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
at 
com.ebay.dss.beam_sql_demo.ToraSqlCli.executeQuery(ToraSqlCli.java:72)
at com.ebay.dss.beam_sql_demo.ToraSqlCli.exec(ToraSqlCli.java:111)
at com.ebay.dss.beam_sql_demo.ToraSqlCli.main(ToraSqlCli.java:130)
{code}


> opt-out shade in extension/sql
> --
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> 

[jira] [Comment Edited] (BEAM-2954) opt-out shade in extension/sql

2017-09-13 Thread Aviem Zur (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165183#comment-16165183
 ] 

Aviem Zur edited comment on BEAM-2954 at 9/13/17 7:48 PM:
--

We cannot opt-out of the shading rules in the SQL module as we cannot leak 
{{Guava}} to our users.

If a shaded version of {{Calcite}} exists which relocates {{Guava}} we should 
use that (such a version does not seem to exist). Otherwise, since {{Calcite}} 
is under {{Apache License 2.0}} we can shade it into our SQL module, using the 
same relocation rules applied in {{beam-parent}}.


was (Author: aviemzur):
We cannot opt-out of the shading rules in the SQL module as we cannot leak 
Guava to our users.

If a shaded version of {{Calcite}} exists which relocates {{Guava}} we should 
use that (such a version does not seem to exist). Otherwise, since {{Calcite}} 
is under {{Apache License 2.0}} we can shade it into our SQL module, using the 
same relocation rules applied in {{beam-parent}}.

> opt-out shade in extension/sql
> --
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
>   at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
>   at 
> com.ebay.dss.beam_sql_demo.ToraSqlCli.executeQuery(ToraSqlCli.java:72)
>   at com.ebay.dss.beam_sql_demo.ToraSqlCli.exec(ToraSqlCli.java:111)
>   at com.ebay.dss.beam_sql_demo.ToraSqlCli.main(ToraSqlCli.java:130)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2954) opt-out shade in extension/sql

2017-09-13 Thread Aviem Zur (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165183#comment-16165183
 ] 

Aviem Zur commented on BEAM-2954:
-

We cannot opt-out of the shading rules in the SQL module as we cannot leak 
Guava to our users.

If a shaded version of {{Calcite}} exists which relocates {{Guava}} we should 
use that (such a version does not seem to exist). Otherwise, since {{Calcite}} 
is under {{Apache License 2.0}} we can shade it into our SQL module, using the 
same relocation rules applied in {{beam-parent}}.

> opt-out shade in extension/sql
> --
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
>   at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
>   at 
> com.ebay.dss.beam_sql_demo.ToraSqlCli.executeQuery(ToraSqlCli.java:72)
>   at com.ebay.dss.beam_sql_demo.ToraSqlCli.exec(ToraSqlCli.java:111)
>   at com.ebay.dss.beam_sql_demo.ToraSqlCli.main(ToraSqlCli.java:130)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is back to stable : beam_PostCommit_Java_ValidatesRunner_Dataflow #3964

2017-09-13 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-2790) Error while reading from Amazon S3 via Hadoop File System

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165153#comment-16165153
 ] 

ASF GitHub Bot commented on BEAM-2790:
--

Github user iemejia closed the pull request at:

https://github.com/apache/beam/pull/3744


> Error while reading from Amazon S3 via Hadoop File System
> -
>
> Key: BEAM-2790
> URL: https://issues.apache.org/jira/browse/BEAM-2790
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Ismaël Mejía
>Assignee: Etienne Chauchot
> Fix For: 2.2.0
>
>
> If you try to use hadoop-aws with Beam to read from AWS S3 it breaks because 
> S3AInputStream (the implementation of Hadoop's FSDataInputStream) is not 
> ByteBufferReadable. 
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: 
> Byte-buffer read unsupported by input stream
>   at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:146)
>   at 
> org.apache.beam.sdk.io.hdfs.HadoopFileSystem$HadoopSeekableByteChannel.read(HadoopFileSystem.java:192)
>   at 
> org.apache.beam.sdk.io.TextSource$TextBasedReader.tryToEnsureNumberOfBytesInBuffer(TextSource.java:232)
>   at 
> org.apache.beam.sdk.io.TextSource$TextBasedReader.findSeparatorBounds(TextSource.java:166)
>   at 
> org.apache.beam.sdk.io.TextSource$TextBasedReader.readNextRecord(TextSource.java:198)
>   at 
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:481)
>   at 
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
>   at 
> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:261)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3744: [BEAM-2790] Use byte[] instead of ByteBuffer to rea...

2017-09-13 Thread iemejia
Github user iemejia closed the pull request at:

https://github.com/apache/beam/pull/3744


---


[jira] [Created] (BEAM-2955) Create a Cloud Bigtable HBase connector

2017-09-13 Thread Solomon Duskis (JIRA)
Solomon Duskis created BEAM-2955:


 Summary: Create a Cloud Bigtable HBase connector
 Key: BEAM-2955
 URL: https://issues.apache.org/jira/browse/BEAM-2955
 Project: Beam
  Issue Type: New Feature
  Components: sdk-java-gcp
Reporter: Solomon Duskis
Assignee: Chamikara Jayalath


The Cloud Bigtable (CBT) team has had a Dataflow connector maintained in a 
different repo for awhile. Recently, we did some reworking of the Cloud 
Bigtable client that would allow it to better coexist in the Beam ecosystem, 
and we also released a Beam connector in our repository that exposes HBase 
idioms rather than the Protobuf idioms of BigtableIO.  More information about 
the customer experience of the HBase connector can be found here: 
[https://cloud.google.com/bigtable/docs/dataflow-hbase].

The Beam repo is a much better place to house a Cloud Bigtable HBase connector. 
 There are a couple of ways we can implement this new connector:

# The CBT connector depends on artifacts in the io/hbase maven project.  We can 
create a new extend HBaseIO for the purposes of CBT.  We would have to add some 
features to HBaseIO to make that work (dynamic rebalancing, and a way for HBase 
and CBT's size estimation models to coexist)
# The BigtableIO connector works well, and we can add an adapter layer on top 
of it.  I have a proof of concept of it here: 
[https://github.com/sduskis/cloud-bigtable-client/tree/add_beam/bigtable-dataflow-parent/bigtable-hbase-beam].
# We can build a separate CBT HBase connector.

I'm happy to do the work.  I would appreciate some guidance and discussion 
about the right approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] beam pull request #3830: Jenkins Test [DO NOT MERGE]

2017-09-13 Thread robertwb
Github user robertwb closed the pull request at:

https://github.com/apache/beam/pull/3830


---


[GitHub] beam pull request #3847: Refactor fn api runner into universal local runner

2017-09-13 Thread robertwb
GitHub user robertwb opened a pull request:

https://github.com/apache/beam/pull/3847

Refactor fn api runner into universal local runner

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/robertwb/incubator-beam ulr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3847


commit 738417af5951e1a84ff184539814cf5cb7fbba1f
Author: Robert Bradshaw 
Date:   2017-09-08T19:25:42Z

Refactor fn api runner into universal local runner.

commit 433a15d3542753b40699aa7185a37dd90f19c48e
Author: Robert Bradshaw 
Date:   2017-09-08T22:31:17Z

Implement ULR subprocess mode.

commit 36c229edea598fe694a9ef936b51fc7c83041b41
Author: Robert Bradshaw 
Date:   2017-09-09T01:25:10Z

Allow worker to be started as a subprocess.

commit f5e97a2c0ff130abc1f6c98adf549e39313b2ea1
Author: Robert Bradshaw 
Date:   2017-09-09T01:36:00Z

Streaming Job API.

commit 7a6abdf6f2cb69631da9f0880842e3615c9612fd
Author: Robert Bradshaw 
Date:   2017-09-13T17:34:53Z

Lint and docs.




---


[jira] [Created] (BEAM-2954) opt-out shade in extension/sql

2017-09-13 Thread Xu Mingmin (JIRA)
Xu Mingmin created BEAM-2954:


 Summary: opt-out shade in extension/sql
 Key: BEAM-2954
 URL: https://issues.apache.org/jira/browse/BEAM-2954
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql
Reporter: Xu Mingmin
Assignee: Xu Mingmin
 Fix For: 2.2.0


{{guava}} is shaded by default in beam modules, while {{calcite}}(a dependency 
of SQL) includes {{guava}} which is not shaded. Below error is thrown when 
calling {{calcite}} methods with guava classed.

{code}
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
at 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
at 
org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
at 
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
at 
org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
at 
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
at 
org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
at 
com.ebay.dss.beam_sql_demo.ToraSqlCli.executeQuery(ToraSqlCli.java:72)
at com.ebay.dss.beam_sql_demo.ToraSqlCli.exec(ToraSqlCli.java:111)
at com.ebay.dss.beam_sql_demo.ToraSqlCli.main(ToraSqlCli.java:130)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2868) Allow FileSystems.copy() to copy between different scheme

2017-09-13 Thread Luke Cwik (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164876#comment-16164876
 ] 

Luke Cwik commented on BEAM-2868:
-

We want to stay away from methods which only take a single thing to do since 
often we can do better with bulk and people will just use the inefficient one 
even though they have a list of things they want to get done because it will 
seem easier. For users that truly only do have one thing to copy, then they 
should rely on 
[https://docs.oracle.com/javase/8/docs/api/java/util/Collections.html#singletonList-T-].

> Allow FileSystems.copy() to copy between different scheme
> -
>
> Key: BEAM-2868
> URL: https://issues.apache.org/jira/browse/BEAM-2868
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Reza ardeshir rokni
>Priority: Minor
>
> It would be useful to allow FileSystem.copy() to be able to copy from 
> different scheme. For example when copying files from a Object store to the 
> worker. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2953) Create more advanced Timeseries processing examples using state API

2017-09-13 Thread Reza ardeshir rokni (JIRA)
Reza ardeshir rokni created BEAM-2953:
-

 Summary: Create more advanced Timeseries processing examples using 
state API
 Key: BEAM-2953
 URL: https://issues.apache.org/jira/browse/BEAM-2953
 Project: Beam
  Issue Type: Improvement
  Components: examples-java
Affects Versions: 2.1.0
Reporter: Reza ardeshir rokni
Assignee: Reuven Lax
Priority: Minor


As described in the phase 1 portion of this solution outline:
https://cloud.google.com/solutions/correlating-time-series-dataflow

BEAM can be used to build out some very interesting pre-processing stages for 
time series data. Some examples that will be useful:

- Downsampling time series based on simple AVG, MIN, MAX
- Creating a value for each time window using generatesequence as a seed 
- Loading the value of a downsample with the previous value (used in FX with 
previous close being brought into current open value) 

This will show some concrete examples of keyed state as well as the use of 
combiners. 
The samples can also be used to show how you can create a ordered list of 
values per key from a unbounded topic which has multiple time series keys. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is still unstable: beam_PostCommit_Java_ValidatesRunner_Dataflow #3963

2017-09-13 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : beam_PostCommit_Java_ValidatesRunner_Apex #2383

2017-09-13 Thread Apache Jenkins Server
See 




Jenkins build is back to stable : beam_PostCommit_Java_ValidatesRunner_Spark #3066

2017-09-13 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : beam_PostCommit_Java_ValidatesRunner_Flink #3832

2017-09-13 Thread Apache Jenkins Server
See 




[jira] [Commented] (BEAM-2868) Allow FileSystems.copy() to copy between different scheme

2017-09-13 Thread Reza ardeshir rokni (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164440#comment-16164440
 ] 

Reza ardeshir rokni commented on BEAM-2868:
---

For this use case, the move is often going to be a single file. Would it make 
sense to create a .copy() signature that just has a single file source and 
destination?

> Allow FileSystems.copy() to copy between different scheme
> -
>
> Key: BEAM-2868
> URL: https://issues.apache.org/jira/browse/BEAM-2868
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Reza ardeshir rokni
>Priority: Minor
>
> It would be useful to allow FileSystem.copy() to be able to copy from 
> different scheme. For example when copying files from a Object store to the 
> worker. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build became unstable: beam_PostCommit_Java_ValidatesRunner_Dataflow #3962

2017-09-13 Thread Apache Jenkins Server
See 




Jenkins build became unstable: beam_PostCommit_Java_ValidatesRunner_Spark #3065

2017-09-13 Thread Apache Jenkins Server
See 




[GitHub] beam pull request #3846: JStorm-runner: Remove unnecessary WARN log which mi...

2017-09-13 Thread bastiliu
GitHub user bastiliu opened a pull request:

https://github.com/apache/beam/pull/3846

JStorm-runner: Remove unnecessary WARN log which might case confusion, when 
translating composite PTransform

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bastiliu/beam jstorm-runner

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/3846.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3846


commit 03cc311cfbabd92390d7a848a135b59d9d80530c
Author: basti.lj 
Date:   2017-09-13T08:17:14Z

JStorm-runner: Remove unnecessary WARN log, which might case confusion.




---


[jira] [Closed] (BEAM-2859) ProcessingTime based timers are not properly fired in case the watermark stays put

2017-09-13 Thread Stas Levin (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stas Levin closed BEAM-2859.

   Resolution: Fixed
Fix Version/s: 2.2.0

> ProcessingTime based timers are not properly fired in case the watermark 
> stays put
> --
>
> Key: BEAM-2859
> URL: https://issues.apache.org/jira/browse/BEAM-2859
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Stas Levin
>Assignee: Stas Levin
> Fix For: 2.2.0
>
>
> {{AfterProcessingTime}} based timers are not fired when the input watermark 
> does not advance, preventing from buffered element to be emitted.
> The reason seems to be that {{SparkTimerInternals#getTimersReadyToProcess()}} 
> determines what triggers are ready to be processed based on the following 
> condition: 
> {code:java}
> timer.getTimestamp().isBefore(inputWatermark)
> {code}
> However, if the timer domain is {{TimeDomain.PROCESSING_TIME}} the position 
> of the input watermark should *NOT* have effect.
> In addition, {{SparkTimerInternals#getTimersReadyToProcess()}} deletes timers 
> once they are deemed eligible for processing (but will not necessarily fire). 
> This may not be the correct behavior for timers in general and for timers in 
> the {{TimeDomain.PROCESSING_TIME}} in particular, since they should remain 
> scheduled until the corresponding window expires and all state is cleared.
> For instance, consider a timer that is found eligible for processing and is 
> thus deleted, then it just so happens to be that its {{shouldFire()}} returns 
> {{false}} and it is not fired and needs to be re-run next time around, but 
> won't, since it's been deleted. The implied moral being that _"eligible for 
> processing"_ does not imply _"should be deleted"_.
> It may be better to avoid removing timers in 
> {{SparkTimerInternals#getTimersReadyToProcess()}} and leave timer management 
> up to {{ReduceFnRunner#clearAllState()}} which has more context to determine 
> whether it's time for a given timer to be deleted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Build failed in Jenkins: beam_PostCommit_Java_ValidatesRunner_Apex #2382

2017-09-13 Thread Apache Jenkins Server
See 


Changes:

[staslevin] [BEAM-2859] Fixed processing timers not being properly fired when

--
[...truncated 42.93 KB...]
2017-09-13T08:07:26.477 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/enforcer/enforcer-api/3.0.0-M1/enforcer-api-3.0.0-M1.pom
 (3 KB at 81.1 KB/sec)
2017-09-13T08:07:26.480 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.24/plexus-utils-3.0.24.pom
2017-09-13T08:07:26.509 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.24/plexus-utils-3.0.24.pom
 (5 KB at 139.0 KB/sec)
2017-09-13T08:07:26.511 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus/4.0/plexus-4.0.pom
2017-09-13T08:07:26.540 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus/4.0/plexus-4.0.pom
 (21 KB at 724.0 KB/sec)
2017-09-13T08:07:26.544 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-component-annotations/1.6/plexus-component-annotations-1.6.pom
2017-09-13T08:07:26.570 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-component-annotations/1.6/plexus-component-annotations-1.6.pom
 (748 B at 28.1 KB/sec)
2017-09-13T08:07:26.572 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-containers/1.6/plexus-containers-1.6.pom
2017-09-13T08:07:26.598 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-containers/1.6/plexus-containers-1.6.pom
 (4 KB at 141.5 KB/sec)
2017-09-13T08:07:26.600 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus/3.3.2/plexus-3.3.2.pom
2017-09-13T08:07:26.629 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus/3.3.2/plexus-3.3.2.pom
 (22 KB at 725.0 KB/sec)
2017-09-13T08:07:26.633 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-container-default/1.0-alpha-9/plexus-container-default-1.0-alpha-9.pom
2017-09-13T08:07:26.659 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-container-default/1.0-alpha-9/plexus-container-default-1.0-alpha-9.pom
 (2 KB at 46.4 KB/sec)
2017-09-13T08:07:26.661 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-containers/1.0.3/plexus-containers-1.0.3.pom
2017-09-13T08:07:26.687 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-containers/1.0.3/plexus-containers-1.0.3.pom
 (492 B at 18.5 KB/sec)
2017-09-13T08:07:26.689 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus/1.0.4/plexus-1.0.4.pom
2017-09-13T08:07:26.716 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus/1.0.4/plexus-1.0.4.pom
 (6 KB at 215.5 KB/sec)
2017-09-13T08:07:26.718 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/junit/junit/4.11/junit-4.11.pom
2017-09-13T08:07:26.745 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/junit/junit/4.11/junit-4.11.pom (3 KB at 
84.8 KB/sec)
2017-09-13T08:07:26.747 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.pom
2017-09-13T08:07:26.773 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.pom
 (766 B at 28.8 KB/sec)
2017-09-13T08:07:26.774 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/hamcrest/hamcrest-parent/1.3/hamcrest-parent-1.3.pom
2017-09-13T08:07:26.801 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/hamcrest/hamcrest-parent/1.3/hamcrest-parent-1.3.pom
 (2 KB at 74.1 KB/sec)
2017-09-13T08:07:26.803 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/classworlds/classworlds/1.1-alpha-2/classworlds-1.1-alpha-2.pom
2017-09-13T08:07:26.829 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/classworlds/classworlds/1.1-alpha-2/classworlds-1.1-alpha-2.pom
 (4 KB at 117.5 KB/sec)
2017-09-13T08:07:26.831 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-project/2.0.9/maven-project-2.0.9.pom
2017-09-13T08:07:26.857 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-project/2.0.9/maven-project-2.0.9.pom
 (3 KB at 101.8 KB/sec)
2017-09-13T08:07:26.859 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven/2.0.9/maven-2.0.9.pom
2017-09-13T08:07:26.887 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven/2.0.9/maven-2.0.9.pom
 (19 KB at 659.4 KB/sec)
2017-09-13T08:07:26.890 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-parent/8/maven-parent-8.pom
2017-09-13T08:07:26.921 [INFO] Downloaded: 

[jira] [Commented] (BEAM-2859) ProcessingTime based timers are not properly fired in case the watermark stays put

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164283#comment-16164283
 ] 

ASF GitHub Bot commented on BEAM-2859:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3840


> ProcessingTime based timers are not properly fired in case the watermark 
> stays put
> --
>
> Key: BEAM-2859
> URL: https://issues.apache.org/jira/browse/BEAM-2859
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Stas Levin
>Assignee: Stas Levin
>
> {{AfterProcessingTime}} based timers are not fired when the input watermark 
> does not advance, preventing from buffered element to be emitted.
> The reason seems to be that {{SparkTimerInternals#getTimersReadyToProcess()}} 
> determines what triggers are ready to be processed based on the following 
> condition: 
> {code:java}
> timer.getTimestamp().isBefore(inputWatermark)
> {code}
> However, if the timer domain is {{TimeDomain.PROCESSING_TIME}} the position 
> of the input watermark should *NOT* have effect.
> In addition, {{SparkTimerInternals#getTimersReadyToProcess()}} deletes timers 
> once they are deemed eligible for processing (but will not necessarily fire). 
> This may not be the correct behavior for timers in general and for timers in 
> the {{TimeDomain.PROCESSING_TIME}} in particular, since they should remain 
> scheduled until the corresponding window expires and all state is cleared.
> For instance, consider a timer that is found eligible for processing and is 
> thus deleted, then it just so happens to be that its {{shouldFire()}} returns 
> {{false}} and it is not fired and needs to be re-run next time around, but 
> won't, since it's been deleted. The implied moral being that _"eligible for 
> processing"_ does not imply _"should be deleted"_.
> It may be better to avoid removing timers in 
> {{SparkTimerInternals#getTimersReadyToProcess()}} and leave timer management 
> up to {{ReduceFnRunner#clearAllState()}} which has more context to determine 
> whether it's time for a given timer to be deleted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[2/2] beam git commit: This closes #3840

2017-09-13 Thread staslevin
This closes #3840


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/50532f0a
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/50532f0a
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/50532f0a

Branch: refs/heads/master
Commit: 50532f0a92d7ce8dbbdc6c3179ab7b9efde6a746
Parents: 8d71ebf c3d4c5d
Author: Stas Levin 
Authored: Wed Sep 13 11:04:20 2017 +0300
Committer: Stas Levin 
Committed: Wed Sep 13 11:04:20 2017 +0300

--
 .../SparkGroupAlsoByWindowViaWindowSet.java | 82 +---
 .../spark/stateful/SparkTimerInternals.java | 15 
 2 files changed, 56 insertions(+), 41 deletions(-)
--




[1/2] beam git commit: [BEAM-2859] Fixed processing timers not being properly fired when watermark stays put by tweaking the way spark-runner was delivering timers to reduceFnRunner in SparkGroupAlsoB

2017-09-13 Thread staslevin
Repository: beam
Updated Branches:
  refs/heads/master 8d71ebf82 -> 50532f0a9


[BEAM-2859] Fixed processing timers not being properly fired when watermark 
stays put by tweaking the way spark-runner was delivering timers to 
reduceFnRunner in SparkGroupAlsoByWindowViaWindowSet


Project: http://git-wip-us.apache.org/repos/asf/beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/c3d4c5d9
Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/c3d4c5d9
Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/c3d4c5d9

Branch: refs/heads/master
Commit: c3d4c5d98cc115dce7e03e64cd29713562ff62b3
Parents: 8d71ebf
Author: Stas Levin 
Authored: Tue Sep 12 10:34:45 2017 +0300
Committer: Stas Levin 
Committed: Wed Sep 13 11:04:08 2017 +0300

--
 .../SparkGroupAlsoByWindowViaWindowSet.java | 82 +---
 .../spark/stateful/SparkTimerInternals.java | 15 
 2 files changed, 56 insertions(+), 41 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/beam/blob/c3d4c5d9/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java
--
diff --git 
a/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java
 
b/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java
index 2258f05..1fb8700 100644
--- 
a/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java
+++ 
b/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java
@@ -18,7 +18,9 @@
 package org.apache.beam.runners.spark.stateful;
 
 import com.google.common.base.Joiner;
+import com.google.common.base.Predicate;
 import com.google.common.collect.AbstractIterator;
+import com.google.common.collect.FluentIterable;
 import com.google.common.collect.Lists;
 import com.google.common.collect.Table;
 import java.io.Serializable;
@@ -51,6 +53,7 @@ import org.apache.beam.sdk.coders.IterableCoder;
 import org.apache.beam.sdk.coders.KvCoder;
 import org.apache.beam.sdk.coders.VarLongCoder;
 import org.apache.beam.sdk.metrics.MetricName;
+import org.apache.beam.sdk.state.TimeDomain;
 import org.apache.beam.sdk.transforms.windowing.BoundedWindow;
 import org.apache.beam.sdk.transforms.windowing.PaneInfo;
 import org.apache.beam.sdk.util.WindowedValue;
@@ -204,6 +207,32 @@ public class SparkGroupAlsoByWindowViaWindowSet implements 
Serializable {
 this.droppedDueToLateness = droppedDueToLateness;
   }
 
+  /**
+   * Retrieves the timers that are eligible for processing by {@link
+   * org.apache.beam.runners.core.ReduceFnRunner}.
+   *
+   * @return A collection of timers that are eligible for processing. For 
a {@link
+   * TimeDomain#EVENT_TIME} timer, this implies that the watermark has 
passed the timer's
+   * timestamp. For other TimeDomains (e.g., {@link
+   * TimeDomain#PROCESSING_TIME}), a timer is always considered 
eligible for processing (no
+   * restrictions).
+   */
+  private Collection 
filterTimersEligibleForProcessing(
+  final Collection timers, final Instant 
inputWatermark) {
+final Predicate eligibleForProcessing =
+new Predicate() {
+
+  @Override
+  public boolean apply(final TimerInternals.TimerData timer) {
+return !timer.getDomain().equals(TimeDomain.EVENT_TIME)
+|| inputWatermark.isAfter(timer.getTimestamp());
+  }
+};
+
+return 
FluentIterable.from(timers).filter(eligibleForProcessing).toSet();
+  }
+
+
   @Override
   protected Tuple2>>*/ List>>
   computeNext() {
@@ -268,16 +297,14 @@ public class SparkGroupAlsoByWindowViaWindowSet 
implements Serializable {
 
   LOG.trace(logPrefix + ": input elements: {}", elements);
 
-  /*
-  Incoming expired windows are filtered based on
-  timerInternals.currentInputWatermarkTime() and the configured 
allowed
-  lateness. Note that this is done prior to calling
-  timerInternals.advanceWatermark so essentially the 
inputWatermark is
-  the highWatermark of the previous batch and the lowWatermark of 
the
-  current batch.
-  The highWatermark of the current batch will only affect filtering
-  as of the next batch.
-   */
+  // Incoming expired windows are filtered based on
+  // timerInternals.currentInputWatermarkTime() and the configured 
allowed
+  // lateness. Note that this is done prior to calling
+   

[GitHub] beam pull request #3840: [BEAM-2859] Fixed processing timers not being prope...

2017-09-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/3840


---


Jenkins build is back to normal : beam_PostCommit_Java_MavenInstall #4786

2017-09-13 Thread Apache Jenkins Server
See