[jira] [Commented] (BEAM-5514) BigQueryIO doesn't handle quotaExceeded errors properly

2018-10-09 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643919#comment-16643919
 ] 

Reuven Lax commented on BEAM-5514:
--

Is the problem simply that ApiErrorExtractor doesn't see quotaExceeded as a 
rate limit error? Appears that it currently looks for either rateLimitExceeded 
or userRateLimitExceeded.

> BigQueryIO doesn't handle quotaExceeded errors properly
> ---
>
> Key: BEAM-5514
> URL: https://issues.apache.org/jira/browse/BEAM-5514
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Kevin Peterson
>Assignee: Reuven Lax
>Priority: Major
>
> When exceeding a streaming quota for BigQuery insertAll requests, BigQuery 
> returns a 403 with reason "quotaExceeded".
> The current implementation of BigQueryIO does not consider this to be a rate 
> limited exception, and therefore does not perform exponential backoff 
> properly, leading to repeated calls to BQ.
> The actual error is in the 
> [ApiErrorExtractor|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServicesImpl.java#L739]
>  class, which is called from 
> [BigQueryServicesImpl|https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/util/src/main/java/com/google/cloud/hadoop/util/ApiErrorExtractor.java#L263]
>  to determine how to retry the failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5426) Use both destination and TableDestination for BQ load job IDs

2018-09-18 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619830#comment-16619830
 ] 

Reuven Lax commented on BEAM-5426:
--

Two issues:
 # I'm not sure how to do this easily as the destinations are sharded across 
all the workers.
 # We don't have a way of failing jobs from in the SDK. The best we can do is 
throw an exception, but that doesn't necessarily fail the job (for Dataflow 
streaming, that will simply result in a infinite exception loop and a stuck 
job).

> Use both destination and TableDestination for BQ load job IDs
> -
>
> Key: BEAM-5426
> URL: https://issues.apache.org/jira/browse/BEAM-5426
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Chamikara Jayalath
>Priority: Major
>
> Currently we use TableDestination when creating a unique load job ID for a 
> destination: 
> [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryHelpers.java#L359]
>  
> This can result in a data loss issue if a user returns the same 
> TableDestination for different destination IDs. I think we can prevent this 
> if we include both IDs in the BQ load job ID.
>  
> CC: [~reuvenlax]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5426) Use both destination and TableDestination for BQ load job IDs

2018-09-18 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619792#comment-16619792
 ] 

Reuven Lax commented on BEAM-5426:
--

If different destinations return the same TableDestination, worse things can 
happen. In that case parallel loads to the same table might happen from 
different workers (since we distribute based on the destination), which can 
cause data corruption (e.g. if the disposition is set to WRITE_TRUNCATE).

> Use both destination and TableDestination for BQ load job IDs
> -
>
> Key: BEAM-5426
> URL: https://issues.apache.org/jira/browse/BEAM-5426
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Chamikara Jayalath
>Priority: Major
>
> Currently we use TableDestination when creating a unique load job ID for a 
> destination: 
> [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryHelpers.java#L359]
>  
> This can result in a data loss issue if a user returns the same 
> TableDestination for different destination IDs. I think we can prevent this 
> if we include both IDs in the BQ load job ID.
>  
> CC: [~reuvenlax]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-30 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598020#comment-16598020
 ] 

Reuven Lax commented on BEAM-5036:
--

Great. We should also fix GCS to use rewrite instead of copy/rename (I
think GCS rewrite didn't exist back when this code was originally written),
though that should probably be in a separate PR.

On Thu, Aug 30, 2018 at 3:13 PM Tim Robertson (JIRA) 



> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-30 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597563#comment-16597563
 ] 

Reuven Lax commented on BEAM-5036:
--

Once you get to the rename step, the set of files to rename should be
deterministic. This isn't currently true for the Flink runner (it is for
Dataflow) because support for @RequiresStableInput is fully implemented,
however without stable input to the rename step many things can go wrong.
The Flink implementation of stable input will block the rename step from
executing until the snapshot is finalized, which means that a rollback will
only rollback that far and not regenerate new output files. This does work
in the current Spark runner (I believe) by forcing an RDD checkpoint.

Of course if the user manually rerurns a pipeline this can happen.

On Thu, Aug 30, 2018 at 7:11 AM Tim Robertson (JIRA) 



> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596950#comment-16596950
 ] 

Reuven Lax commented on BEAM-5036:
--

Actually I believe GCS _does_ support efficient object rename. Check out 
https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-5027) Schemas do not work on Dataflow runner of FnApi Runner

2018-08-29 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax closed BEAM-5027.

   Resolution: Fixed
Fix Version/s: 2.7.0

> Schemas do not work on Dataflow runner of FnApi Runner
> --
>
> Key: BEAM-5027
> URL: https://issues.apache.org/jira/browse/BEAM-5027
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596548#comment-16596548
 ] 

Reuven Lax commented on BEAM-5036:
--

[~sinisa_lyh] I don't think there is any semantically-correct way to write 
directly to final files (unless you're ok with incomplete or corrupted output), 
and I don't think Beam ever did that. If workers crash, etc. you'll end up with 
partially-written files. What's more, there's no guarantee that a retry will 
write the exact same data in the files.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596545#comment-16596545
 ] 

Reuven Lax commented on BEAM-5036:
--

As to ignoring errors: the one thing we need to make sure is that the operation 
is idempotent. The bundle might fail at any point and get retried, and the 
retry should succeed if possible.

For filesystems that use copy/delete, this means that we should ignore 
file-already-exists errors. Otherwise retrying the bundle will cause a 
permanent failure as the transform gets retried, and eventually fail the job 
(depending on runner).

For filesystems such as HDFS (or local) for which atomic rename exists, this 
means we have to ignore failures where the _source_ file doesn't exist (we also 
have to do this with GCS/S3). I believe the code already attempts to do this 
with IGNORE_MISSING_FILES, though there are slight race conditions in that 
check today.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-28 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594618#comment-16594618
 ] 

Reuven Lax commented on BEAM-5036:
--

I think we should create the directory if it doesn't exist.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-23 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590437#comment-16590437
 ] 

Reuven Lax commented on BEAM-5036:
--

As mentioned, Beam currently allows the temp directory to be different than the 
target directory. I believe that HDFS does support this, so we only have to 
handle the case where the target directory doesn't exist. I'm not sure if all 
filesystems support this though.

The API as provided allows users to put the temp files in a different 
filesystem than the target files (i.e. GCS v.s. HDFS). However in practice we 
don't support this, and should probably warn or fail if the user attempts to do 
this.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5126) PreCommit filtering broken based upon PR contents

2018-08-13 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579245#comment-16579245
 ] 

Reuven Lax commented on BEAM-5126:
--

Any update on this bug?

> PreCommit filtering broken based upon PR contents
> -
>
> Key: BEAM-5126
> URL: https://issues.apache.org/jira/browse/BEAM-5126
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Luke Cwik
>Assignee: Alan Myrvold
>Priority: Minor
>
> PR precommits used to be filtered by the contents of the PR.
>  
> Example PR that should have only spawned the Java PreCommit:
> https://github.com/apache/beam/pull/6159
>  
> This broke work done in BEAM-4445



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5146) GetterBasedSchemaProvider might create inconsistent views of the same schema

2018-08-13 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579227#comment-16579227
 ] 

Reuven Lax commented on BEAM-5146:
--

This caused a failure reported by a user on Stackoverflow: 
https://stackoverflow.com/questions/51815416/beam-sql-failing-while-running-with-dataflow-runner

> GetterBasedSchemaProvider might create inconsistent views of the same schema
> 
>
> Key: BEAM-5146
> URL: https://issues.apache.org/jira/browse/BEAM-5146
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
>
> schemaFor is called inside the toRowFunction, and schemaFor is not 
> deterministic (since Java reflection order is not deterministic). Since the 
> toRowFunction is invoked on each worker separately, this creates inconsistent 
> views of the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5146) GetterBasedSchemaProvider might create inconsistent views of the same schema

2018-08-13 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-5146:


 Summary: GetterBasedSchemaProvider might create inconsistent views 
of the same schema
 Key: BEAM-5146
 URL: https://issues.apache.org/jira/browse/BEAM-5146
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


schemaFor is called inside the toRowFunction, and schemaFor is not 
deterministic (since Java reflection order is not deterministic). Since the 
toRowFunction is invoked on each worker separately, this creates inconsistent 
views of the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5105) Move load job poll to finishBundle() method to better parallelize execution

2018-08-07 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572629#comment-16572629
 ] 

Reuven Lax commented on BEAM-5105:
--

Possible to do, but slightly tricky to implement for a couple of reason. One is 
that the retry code for failed inserts will have to be triggered from 
finishBundle. Polling code gets more complicated because you have to poll N 
jobs instead of one at a time. There will also be more bookkeeping needed to 
keep windows correct (because multiple windows can occur inside a single 
bundle).

> Move load job poll to finishBundle() method to better parallelize execution
> ---
>
> Key: BEAM-5105
> URL: https://issues.apache.org/jira/browse/BEAM-5105
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Chamikara Jayalath
>Assignee: Reuven Lax
>Priority: Major
>
> It appears that when we write to BigQuery using WriteTablesDoFn we start a 
> load job and wait for that job to finish.
> [https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L318]
>  
> In cases where we are trying to write a PCollection of tables (for example, 
> when user use dynamic destinations feature) this relies on dynamic work 
> rebalancing to parallellize execution of load jobs. If the runner does not 
> support dynamic work rebalancing or does not execute dynamic work rebalancing 
> from some reason this could have significant performance drawbacks. For 
> example, scheduling times for load jobs will add up.
>  
> A better approach might be to start load jobs at process() method but wait 
> for all load jobs to finish at finishBundle() method. This will parallelize 
> any overheads as well as job execution (assuming more than one job is 
> schedule by BQ.).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5092) Nexmark 10x performance regression

2018-08-06 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570860#comment-16570860
 ] 

Reuven Lax commented on BEAM-5092:
--

Is there an easy way to manually run the benchmark to verify fixes?

> Nexmark 10x performance regression
> --
>
> Key: BEAM-5092
> URL: https://issues.apache.org/jira/browse/BEAM-5092
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Andrew Pilloud
>Assignee: Reuven Lax
>Priority: Critical
>
> There looks to be a 10x performance hit on the DirectRunner and Flink nexmark 
> jobs. It first showed up in this build:
> [https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_Nexmark_Direct/151/changes]
> [https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424]
> [https://apache-beam-testing.appspot.com/explore?dashboard=5699257587728384]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5092) Nexmark 10x performance regression

2018-08-06 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570847#comment-16570847
 ] 

Reuven Lax commented on BEAM-5092:
--

Looking at the PR there are only two things that could have affected Nexmark:

   1. I changed various longs to DateTime objects. This might be slower, but is 
a good change here.

   2. I added a schema to Nexmark types for SQL nexmark, but looks like that is 
taking priority over the hard-coded Nexmark Coder. This is easily fixed.

Also, SchemaCoder isn't expected to be slow, but on investigation it appears 
that SchemaCoder does not implement structuralValue. This means that Beam will 
encode elements every time it want to check them for equality, which is quite 
expensive. This is also easy to fix.

> Nexmark 10x performance regression
> --
>
> Key: BEAM-5092
> URL: https://issues.apache.org/jira/browse/BEAM-5092
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Andrew Pilloud
>Assignee: Reuven Lax
>Priority: Critical
>
> There looks to be a 10x performance hit on the DirectRunner and Flink nexmark 
> jobs. It first showed up in this build:
> [https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_Nexmark_Direct/151/changes]
> [https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424]
> [https://apache-beam-testing.appspot.com/explore?dashboard=5699257587728384]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5092) Nexmark 10x performance regression

2018-08-06 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570826#comment-16570826
 ] 

Reuven Lax commented on BEAM-5092:
--

Do we know the range of PRs where this happened? There are a couple of 
possibilities here, some of them expected (I changed some thing to be more 
strongly typed, which might result in some performance slowdown).

 

> Nexmark 10x performance regression
> --
>
> Key: BEAM-5092
> URL: https://issues.apache.org/jira/browse/BEAM-5092
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Andrew Pilloud
>Assignee: Reuven Lax
>Priority: Critical
>
> There looks to be a 10x performance hit on the DirectRunner and Flink nexmark 
> jobs. It first showed up in this build:
> [https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_Nexmark_Direct/151/changes]
> [https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424]
> [https://apache-beam-testing.appspot.com/explore?dashboard=5699257587728384]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5070) nexmark.sources.UnboundedEventSourceTest.resumeFromCheckpoint is flaky

2018-08-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-5070:


 Summary: 
nexmark.sources.UnboundedEventSourceTest.resumeFromCheckpoint is flaky
 Key: BEAM-5070
 URL: https://issues.apache.org/jira/browse/BEAM-5070
 Project: Beam
  Issue Type: Bug
  Components: test-failures
Affects Versions: 2.5.0
Reporter: Reuven Lax


This test fails fairly frequently



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4453) Provide automatic schema registration for POJOs

2018-08-01 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-4453.
--
   Resolution: Fixed
Fix Version/s: 2.6.0

> Provide automatic schema registration for POJOs
> ---
>
> Key: BEAM-4453
> URL: https://issues.apache.org/jira/browse/BEAM-4453
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.6.0
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4452) Create a lazy row on top of a generic Getter interface

2018-08-01 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-4452.
--
   Resolution: Fixed
Fix Version/s: 2.7.0

> Create a lazy row on top of a generic Getter interface
> --
>
> Key: BEAM-4452
> URL: https://issues.apache.org/jira/browse/BEAM-4452
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.7.0
>
>
> This allows us to have a Row object that uses the underlying user object 
> (POJO, proto, Avro, etc.) as its intermediate storage. This will save a lot 
> of expensive conversions back and forth to Row on each ParDo.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4794) Move Nexmark and SQL to use the new Schema framework

2018-07-31 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-4794.
--
   Resolution: Fixed
Fix Version/s: 2.7.0

> Move Nexmark and SQL to use the new Schema framework
> 
>
> Key: BEAM-4794
> URL: https://issues.apache.org/jira/browse/BEAM-4794
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.7.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This will allow SQL to accept user types. It will also allow the deletion of 
> a lot of code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5040) BigQueryIO retries infinitely in WriteTable and WriteRename

2018-07-27 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-5040:


 Summary: BigQueryIO retries infinitely in WriteTable and 
WriteRename
 Key: BEAM-5040
 URL: https://issues.apache.org/jira/browse/BEAM-5040
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.5.0
Reporter: Reuven Lax
Assignee: Reuven Lax


BigQueryIO retries infinitely in WriteTable and WriteRename

Several failure scenarios with the current code:
 # It's possible for a load job to return failure even though it actually 
succeeded (e.g. the reply might have timed out). In this case, BigQueryIO will 
retry the job which will fail again (because the job id has already been used), 
leading to indefinite retries. Correct behavior is to stop retrying as the load 
job has succeeded.
 # It's possible for a load job to be accepted by BigQuery, but then to fail on 
the BigQuery side. In this case a retry with the same job id will fail as that 
job id has already been used. BigQueryIO will sometimes detect this, but if the 
worker has restarted it will instead issue a load with the old job id and go 
into a retry loop. Correct behavior is to generate a new deterministic job id 
and retry using that new job id.
 # In many cases of worker restart, BigQueryIO ends up in infinite retry loops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5027) Schemas do not work on Dataflow runner of FnApi Runner

2018-07-26 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-5027:


 Summary: Schemas do not work on Dataflow runner of FnApi Runner
 Key: BEAM-5027
 URL: https://issues.apache.org/jira/browse/BEAM-5027
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4822) Beam FileIO should support versioned file systems

2018-07-18 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4822:


 Summary: Beam FileIO should support versioned file systems
 Key: BEAM-4822
 URL: https://issues.apache.org/jira/browse/BEAM-4822
 Project: Beam
  Issue Type: Bug
  Components: io-java-files
Reporter: Reuven Lax
Assignee: Chamikara Jayalath


Some file systems (e.g. GCS) are versioned, and support reading previous 
generations of files. Since Beam's file support does not currently support this 
concept, the latest versions of files are always the ones returned. Users 
should be able to specify that they want to read a previous version of a file 
in FileIO.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4794) Move Nexmark and SQL to use the new Schema framework

2018-07-14 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4794:


 Summary: Move Nexmark and SQL to use the new Schema framework
 Key: BEAM-4794
 URL: https://issues.apache.org/jira/browse/BEAM-4794
 Project: Beam
  Issue Type: Sub-task
  Components: dsl-sql
Reporter: Reuven Lax
Assignee: Reuven Lax


This will allow SQL to accept user types. It will also allow the deletion of a 
lot of code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4793) Enable schemas for all runners

2018-07-14 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4793:


 Summary: Enable schemas for all runners
 Key: BEAM-4793
 URL: https://issues.apache.org/jira/browse/BEAM-4793
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


Currently schemas are only enabled in the direct runner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4451) SchemaRegistry should support a ServiceLoader interface

2018-07-02 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-4451.
--
   Resolution: Fixed
Fix Version/s: 2.6.0

> SchemaRegistry should support a ServiceLoader interface
> ---
>
> Key: BEAM-4451
> URL: https://issues.apache.org/jira/browse/BEAM-4451
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This will allow JARs to register schemas only when they are linked in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4698) SchemaTest not sickbayed for Samza runner

2018-06-30 Thread Reuven Lax (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528907#comment-16528907
 ] 

Reuven Lax commented on BEAM-4698:
--

https://github.com/apache/beam/pull/5848

On Sat, Jun 30, 2018 at 2:33 PM Kenneth Knowles (JIRA) 



> SchemaTest not sickbayed for Samza runner
> -
>
> Key: BEAM-4698
> URL: https://issues.apache.org/jira/browse/BEAM-4698
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-samza
>Reporter: Kenneth Knowles
>Assignee: Reuven Lax
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/
> This failed the SchemaTest, which we know all runners will fail since it uses 
> a different entry point for {{SimpleDoFnRunner}}. It should be excluded from 
> their test suites.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4697) Build broken

2018-06-30 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4697:


 Summary: Build broken
 Key: BEAM-4697
 URL: https://issues.apache.org/jira/browse/BEAM-4697
 Project: Beam
  Issue Type: Bug
  Components: runner-core
Reporter: Reuven Lax
Assignee: Kenneth Knowles


Caused by a race between two separate PRs being merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4451) SchemaRegistry should support a ServiceLoader interface

2018-06-29 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-4451:
-
Summary: SchemaRegistry should support a ServiceLoader interface  (was: 
SchemaRegistry should support a ServiceLoader interfacen)

> SchemaRegistry should support a ServiceLoader interface
> ---
>
> Key: BEAM-4451
> URL: https://issues.apache.org/jira/browse/BEAM-4451
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will allow JARs to register schemas only when they are linked in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4613) Improve performance of SchemaCoder

2018-06-21 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4613:


 Summary: Improve performance of SchemaCoder
 Key: BEAM-4613
 URL: https://issues.apache.org/jira/browse/BEAM-4613
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Kenneth Knowles






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3979) New DoFn should allow injecting of all parameters in ProcessContext

2018-06-05 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-3979:
-
Fix Version/s: (was: 2.6.0)
   2.50

> New DoFn should allow injecting of all parameters in ProcessContext
> ---
>
> Key: BEAM-3979
> URL: https://issues.apache.org/jira/browse/BEAM-3979
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.4.0
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> This was intended in the past, but never completed. Ideally all primitive 
> parameters in ProcessContext should be injectable, and OutputReceiver 
> parameters can be used to collection output. So, we should be able to write a 
> DoFn as follows
> @ProcessElement
> public void process(@Element String word, OutputReceiver receiver) {
>   receiver.output(word.toUpperCase());
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3979) New DoFn should allow injecting of all parameters in ProcessContext

2018-06-05 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-3979.
--
Resolution: Fixed

> New DoFn should allow injecting of all parameters in ProcessContext
> ---
>
> Key: BEAM-3979
> URL: https://issues.apache.org/jira/browse/BEAM-3979
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.4.0
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> This was intended in the past, but never completed. Ideally all primitive 
> parameters in ProcessContext should be injectable, and OutputReceiver 
> parameters can be used to collection output. So, we should be able to write a 
> DoFn as follows
> @ProcessElement
> public void process(@Element String word, OutputReceiver receiver) {
>   receiver.output(word.toUpperCase());
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3979) New DoFn should allow injecting of all parameters in ProcessContext

2018-06-05 Thread Reuven Lax (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-3979:
-
Fix Version/s: (was: 2.50)
   2.5.0

> New DoFn should allow injecting of all parameters in ProcessContext
> ---
>
> Key: BEAM-3979
> URL: https://issues.apache.org/jira/browse/BEAM-3979
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.4.0
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> This was intended in the past, but never completed. Ideally all primitive 
> parameters in ProcessContext should be injectable, and OutputReceiver 
> parameters can be used to collection output. So, we should be able to write a 
> DoFn as follows
> @ProcessElement
> public void process(@Element String word, OutputReceiver receiver) {
>   receiver.output(word.toUpperCase());
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4461) Create a library of useful transforms that use schemas

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4461:


 Summary: Create a library of useful transforms that use schemas
 Key: BEAM-4461
 URL: https://issues.apache.org/jira/browse/BEAM-4461
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


e.g. JoinBy(fields). Project, Filter, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4460) Investigate other encoding mechanism for SchemaCoder

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4460:


 Summary: Investigate other encoding mechanism for SchemaCoder
 Key: BEAM-4460
 URL: https://issues.apache.org/jira/browse/BEAM-4460
 Project: Beam
  Issue Type: Sub-task
  Components: io-java-gcp
Reporter: Reuven Lax
Assignee: Reuven Lax


Some possibilities: FlatBuffer or Arrow. If we made the FnApi aware of schemas, 
the FnApi might be able to use Arrow to encode bundles sent over the wire, and 
we could dispense with having a "tradtional" coder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4459) Schemas across pipeline modifications

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4459:


 Summary: Schemas across pipeline modifications
 Key: BEAM-4459
 URL: https://issues.apache.org/jira/browse/BEAM-4459
 Project: Beam
  Issue Type: Sub-task
  Components: io-java-gcp
Reporter: Reuven Lax
Assignee: Reuven Lax


As per the snapshot/update proposal, we want to be able to update pipelines 
without damaging the in-flight state. Since schema fields might get reordered 
on update, we must ensure that the old mappings are preserved. This will 
require us to have two ids - the logical ids the user interfaces with (which 
might change), and the physical index where we store the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4458) Support unknown fields in Rows

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4458:


 Summary: Support unknown fields in Rows
 Key: BEAM-4458
 URL: https://issues.apache.org/jira/browse/BEAM-4458
 Project: Beam
  Issue Type: Sub-task
  Components: io-java-gcp
Reporter: Reuven Lax
Assignee: Reuven Lax


As input data often evolves unexpectedly, Row needs a way of handling unknown 
input fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4457) Analyze FieldAccessDescriptors and drop fields that are never accessed

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4457:


 Summary: Analyze FieldAccessDescriptors and drop fields that are 
never accessed
 Key: BEAM-4457
 URL: https://issues.apache.org/jira/browse/BEAM-4457
 Project: Beam
  Issue Type: Sub-task
  Components: io-java-gcp
Reporter: Reuven Lax
Assignee: Reuven Lax


We can walk backwards through the graph, analyzing which fields are accessed. 
When we find paths where many fields are never accessed, we can insert a 
projection transform to drop those fields preemptively. This can save a lot of 
resources in the case where many fields in the input are never accessed.

To do this, the FieldAccessDescriptor information must be added to the 
portability protos. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4456) Provide automatic schema registration for BigQuery TableRows

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4456:


 Summary: Provide automatic schema registration for BigQuery 
TableRows
 Key: BEAM-4456
 URL: https://issues.apache.org/jira/browse/BEAM-4456
 Project: Beam
  Issue Type: Sub-task
  Components: io-java-gcp
Reporter: Reuven Lax
Assignee: Reuven Lax


This should be part of the GCP module



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4454) Provide automatic schema registration for AVROs

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4454:


 Summary: Provide automatic schema registration for AVROs
 Key: BEAM-4454
 URL: https://issues.apache.org/jira/browse/BEAM-4454
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


Need to make sure this is a compatible change



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4455) Provide automatic schema registration for Protos

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4455:


 Summary: Provide automatic schema registration for Protos
 Key: BEAM-4455
 URL: https://issues.apache.org/jira/browse/BEAM-4455
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


Need to make sure this is a compatible change



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4453) Provide automatic schema registration for POJOs

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4453:


 Summary: Provide automatic schema registration for POJOs
 Key: BEAM-4453
 URL: https://issues.apache.org/jira/browse/BEAM-4453
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4452) Create a lazy row on top of a generic Getter interface

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4452:


 Summary: Create a lazy row on top of a generic Getter interface
 Key: BEAM-4452
 URL: https://issues.apache.org/jira/browse/BEAM-4452
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


This allows us to have a Row object that uses the underlying user object (POJO, 
proto, Avro, etc.) as its intermediate storage. This will save a lot of 
expensive conversions back and forth to Row on each ParDo.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4451) SchemaRegistry should support a ServiceLoader interfacen

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4451:


 Summary: SchemaRegistry should support a ServiceLoader interfacen
 Key: BEAM-4451
 URL: https://issues.apache.org/jira/browse/BEAM-4451
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


This will allow JARs to register schemas only when they are linked in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4450) Valildate that OutputReceiver is only allowed if the output PCollection has a schema

2018-06-03 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-4450:


 Summary: Valildate that OutputReceiver is only allowed if the 
output PCollection has a schema
 Key: BEAM-4450
 URL: https://issues.apache.org/jira/browse/BEAM-4450
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Reuven Lax
Assignee: Reuven Lax


We don't do this today as coder/schema inference may not be complete at the 
point where we validate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

2018-04-17 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441270#comment-16441270
 ] 

Reuven Lax commented on BEAM-3268:
--

[~jkff] I don't understand your comment. Outputs are not sent to receivers 
until the ParDo has finished processing data, which is after the file copy.

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---
>
> Key: BEAM-3268
> URL: https://issues.apache.org/jira/browse/BEAM-3268
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
>Reporter: Kamil Szewczyk
>Assignee: Eugene Kirpichov
>Priority: Major
> Attachments: comparison.png
>
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.create())
> .apply(Reshuffle.viaRandomKey());
> PCollection consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4083) TypeName should be a proper algebraic type, and probably just called BeamType

2018-04-16 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440365#comment-16440365
 ] 

Reuven Lax commented on BEAM-4083:
--

TypeName is not itself a type. FieldType is the actual "type" - TypeName simply 
segregates classes of types.

> TypeName should be a proper algebraic type, and probably just called BeamType
> -
>
> Key: BEAM-4083
> URL: https://issues.apache.org/jira/browse/BEAM-4083
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Priority: Major
>
> TypeName mixes atomic types and type constructors. Or, equivalently, it does 
> not distinguish by arity the type constructors. It would be best to make this 
> an ADT.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4081) Review of schema metadata vs schema types

2018-04-16 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440363#comment-16440363
 ] 

Reuven Lax commented on BEAM-4081:
--

agree, however I would recommend being conservative about what we promote to be 
a basic type, and err on keeping the Beam-level type set small. e.g. right now 
I see no good reason for Beam to distinguish between CHAR and VARCHAR, even 
though SQL does need to make this distinction (though maybe we will find a good 
reason for Beam to care).

> Review of schema metadata vs schema types
> -
>
> Key: BEAM-4081
> URL: https://issues.apache.org/jira/browse/BEAM-4081
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Priority: Major
>
> The Schema basic types have a place for metadata that can say "this int is 
> really millis since epoch". This deserves some careful design review and 
> perhaps some of these need to be promoted to basic types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4080) Consider Schema.join to automatically produce a correct joined schema

2018-04-16 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440362#comment-16440362
 ] 

Reuven Lax commented on BEAM-4080:
--

Some subtleties here: What if both schemas contain a field with the same name? 
In SQL this is legal, and the field can be accessed by extra scoping.

> Consider Schema.join to automatically produce a correct joined schema
> -
>
> Key: BEAM-4080
> URL: https://issues.apache.org/jira/browse/BEAM-4080
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4077) Refactor builder field nullability

2018-04-16 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440359#comment-16440359
 ] 

Reuven Lax commented on BEAM-4077:
--

every type can be nullable, so this would mean doubling the number of builder 
methods. There already is a a builder method that takes in a Field object 
(addField) which technically provides full generality. The others were added 
simply because I found that it increased ease of use of the builder.

> Refactor builder field nullability
> --
>
> Key: BEAM-4077
> URL: https://issues.apache.org/jira/browse/BEAM-4077
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Priority: Major
>
> Currently the Schema builder methods take a boolean for nullability. It would 
> be more standard to have separate builder methods. At this point the builder 
> might as well just take the Field spec since it does not add concision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3979) New DoFn should allow injecting of all parameters in ProcessContext

2018-04-01 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-3979:


 Summary: New DoFn should allow injecting of all parameters in 
ProcessContext
 Key: BEAM-3979
 URL: https://issues.apache.org/jira/browse/BEAM-3979
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Affects Versions: 2.4.0
Reporter: Reuven Lax
Assignee: Reuven Lax
 Fix For: 2.5.0


This was intended in the past, but never completed. Ideally all primitive 
parameters in ProcessContext should be injectable, and OutputReceiver 
parameters can be used to collection output. So, we should be able to write a 
DoFn as follows

@ProcessElement

public void process(@Element String word, OutputReceiver receiver) {

  receiver.output(word.toUpperCase());

}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3772) BigQueryIO - Can't use DynamicDestination with CREATE_IF_NEEDED for unbounded PCollection and FILE_LOADS

2018-03-19 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405089#comment-16405089
 ] 

Reuven Lax commented on BEAM-3772:
--

What version of the SDK are you using?

> BigQueryIO - Can't use DynamicDestination with CREATE_IF_NEEDED for unbounded 
> PCollection and FILE_LOADS
> 
>
> Key: BEAM-3772
> URL: https://issues.apache.org/jira/browse/BEAM-3772
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.2.0, 2.3.0
> Environment: Dataflow streaming pipeline
>Reporter: Benjamin BENOIST
>Assignee: Eugene Kirpichov
>Priority: Major
>
> My workflow : KAFKA -> Dataflow streaming -> BigQuery
> Given that having low-latency isn't important in my case, I use FILE_LOADS to 
> reduce the costs. I'm using _BigQueryIO.Write_ with a _DynamicDestination_, 
> which is a table with the current hour as a suffix.
> This _BigQueryIO.Write_ is configured like this :
> {code:java}
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withMethod(Method.FILE_LOADS)
> .withTriggeringFrequency(triggeringFrequency)
> .withNumFileShards(100)
> {code}
> The first table is successfully created and is written to. But then the 
> following tables are never created and I get these exceptions:
> {code:java}
> (99e5cd8c66414e7a): java.lang.RuntimeException: Failed to create load job 
> with id prefix 
> 5047f71312a94bf3a42ee5d67feede75_5295fbf25e1a7534f85e25dcaa9f4986_1_00023,
>  reached max retries: 3, last failed load job: {
>   "configuration" : {
> "load" : {
>   "createDisposition" : "CREATE_NEVER",
>   "destinationTable" : {
> "datasetId" : "dev_mydataset",
> "projectId" : "myproject-id",
> "tableId" : "mytable_20180302_16"
>   },
> {code}
> The _CreateDisposition_ used is _CREATE_NEVER_, contrary as 
> _CREATE_IF_NEEDED_ as specified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3409) Unexpected behavior of DoFn teardown method running in unit tests

2018-02-27 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379317#comment-16379317
 ] 

Reuven Lax commented on BEAM-3409:
--

So teardown isn't being called, it's just that waitUntilFinish isn't properly 
waiting for tearDown in direct runner. Is that correct?

> Unexpected behavior of DoFn teardown method running in unit tests 
> --
>
> Key: BEAM-3409
> URL: https://issues.apache.org/jira/browse/BEAM-3409
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.3.0
>Reporter: Alexey Romanenko
>Assignee: Romain Manni-Bucau
>Priority: Blocker
>  Labels: test
> Fix For: 2.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Writing a unit test, I found out a strange behaviour of Teardown method of 
> DoFn implementation when I run this method in unit tests using TestPipeline.
> To be more precise, it doesn’t wait until teardown() method will be finished, 
> it just exits from this method after about 1 sec (on my machine) even if it 
> should take longer (very simple example - running infinite loop inside this 
> method or put thread in sleep). In the same time, when I run the same code 
> from main() with ordinary Pipeline and direct runner, then it’s ok and it 
> works as expected - teardown() method will be performed completely despite 
> how much time it will take.
> I created two test cases to reproduce this issue - the first one to run with 
> main() and the second one to run with junit. They use the same implementation 
> of DoFn (class LongTearDownFn) and expects that teardown method will be 
> running at least for SLEEP_TIME ms. In case of running as junit test it's not 
> a case (see output log).
> - run with main()
> https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/main/java/TearDown.java
> - run with junit
> https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/test/java/TearDownTest.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3582) BeamRecord should be called BeamRow instead

2018-01-31 Thread Reuven Lax (JIRA)
Reuven Lax created BEAM-3582:


 Summary: BeamRecord should be called BeamRow instead
 Key: BEAM-3582
 URL: https://issues.apache.org/jira/browse/BEAM-3582
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql
Reporter: Reuven Lax
Assignee: Reuven Lax


All elements in Beam are referred to as "records," so the current class name is 
confusing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3503) PubsubIO - DynamicDestinations

2018-01-19 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333167#comment-16333167
 ] 

Reuven Lax commented on BEAM-3503:
--

If you're using the Dataflow streaming runner, then a Beam-only fix is 
insufficient as Dataflow uses a separate implementation of PubSubIO in the 
runner. You'll also need to file a bug with Google to support this feature in 
the Dataflow runner.

> PubsubIO - DynamicDestinations
> --
>
> Key: BEAM-3503
> URL: https://issues.apache.org/jira/browse/BEAM-3503
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions, sdk-java-gcp
>Affects Versions: 2.2.0
>Reporter: Nalseez Duke
>Assignee: Reuven Lax
>Priority: Minor
>
> PubsubIO does not support the "DynamicDestinations" notion that is currently 
> implemented for File-based I/O and BigQueryIO.
> It would be nice if PubsubIO could also support this functionality - the 
> ability to write to a Pub/Sub topic dynamically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3503) PubsubIO - DynamicDestinations

2018-01-19 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333024#comment-16333024
 ] 

Reuven Lax commented on BEAM-3503:
--

which runner are you interested in?

> PubsubIO - DynamicDestinations
> --
>
> Key: BEAM-3503
> URL: https://issues.apache.org/jira/browse/BEAM-3503
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions, sdk-java-gcp
>Affects Versions: 2.2.0
>Reporter: Nalseez Duke
>Assignee: Reuven Lax
>Priority: Minor
>
> PubsubIO does not support the "DynamicDestinations" notion that is currently 
> implemented for File-based I/O and BigQueryIO.
> It would be nice if PubsubIO could also support this functionality - the 
> ability to write to a Pub/Sub topic dynamically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3276) Pydocs for 2.2.0 are down

2017-12-03 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-3276.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Pydocs for 2.2.0 are down
> -
>
> Key: BEAM-3276
> URL: https://issues.apache.org/jira/browse/BEAM-3276
> Project: Beam
>  Issue Type: Bug
>  Components: website
>Reporter: Daniel Ho
>Assignee: Reuven Lax
> Fix For: 2.2.0
>
>
> The link in the main website 
> https://beam.apache.org/documentation/sdks/python/ 
> points to
> https://beam.apache.org/documentation/sdks/pydoc/2.2.0/
> which is 
> Not Found
> The requested URL /documentation/sdks/pydoc/2.2.0/ was not found on this 
> server.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2957) Fix flaky ElasticsearchIOTest.testSplit in beam-sdks-java-io-elasticsearch-tests-5

2017-12-02 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2957:
-
Fix Version/s: (was: 2.2.0)

> Fix flaky ElasticsearchIOTest.testSplit in 
> beam-sdks-java-io-elasticsearch-tests-5
> --
>
> Key: BEAM-2957
> URL: https://issues.apache.org/jira/browse/BEAM-2957
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>
> ElasticsearchIOTest.testSplit is flaky on number of empty splits.
> May be due to randomization in ES 5 test framework and empty ES slices.
> {code}
> [ERROR] testSplit(org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest)  
> Time elapsed: 4.976 s  <<< FAILURE!
> java.lang.AssertionError: Wrong number of empty splits expected:<74> but 
> was:<72>
>   at 
> __randomizedtesting.SeedInfo.seed([EA22384A083D1EE4:89750F250AB441D4]:0)
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testSplit(ElasticsearchIOTest.java:182)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1713)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:907)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:943)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:957)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:327)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
>   at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
>   at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
>   at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
>   at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
>   at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
>   at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:916)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:802)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:852)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863)
>   at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
>   at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>   at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
>   at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
>   at 
> 

[jira] [Updated] (BEAM-2870) BQ Partitioned Table Write Fails When Destination has Partition Decorator

2017-12-02 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2870:
-
Affects Version/s: (was: 2.2.0)
   2.3.0
Fix Version/s: (was: 2.2.0)

> BQ Partitioned Table Write Fails When Destination has Partition Decorator
> -
>
> Key: BEAM-2870
> URL: https://issues.apache.org/jira/browse/BEAM-2870
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
> Environment: Dataflow Runner, Streaming, 10 x (n1-highmem-8 & 500gb 
> SDD)
>Reporter: Steven Jon Anderson
>Assignee: Reuven Lax
>  Labels: bigquery, dataflow, google, google-cloud-bigquery, 
> google-dataflow
> Fix For: 2.3.0
>
>
> Dataflow Job ID: 
> https://console.cloud.google.com/dataflow/job/2017-09-08_23_03_14-14637186041605198816
> Tagging [~reuvenlax] as I believe he built the time partitioning integration 
> that was merged into master.
> *Background*
> Our production pipeline ingests millions of events per day and routes events 
> into our clients' numerous tables. To keep costs down, all of our tables are 
> partitioned. However, this requires that we create the tables before we allow 
> events to process as creating partitioned tables isn't supported in 2.1.0. 
> We've been looking forward to [~reuvenlax]'s partition table write feature 
> ([#3663|https://github.com/apache/beam/pull/3663]) to get merged into master 
> for some time now as it'll allow us to launch our client platforms much, much 
> faster. Today we got around to testing the 2.2.0 nightly and discovered this 
> bug.
> *Issue*
> Our pipeline writes to a table with a decorator. When attempting to write to 
> an existing partitioned table with a decorator, the write succeeds. When 
> using a partitioned table destination that doesn't exist without a decorator, 
> the write succeeds. *However, when writing to a partitioned table that 
> doesn't exist with a decorator, the write fails*. 
> *Example Implementation*
> {code:java}
> BigQueryIO.writeTableRows()
>   .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>   .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>   .withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
>   .to(new DynamicDestinations() {
> @Override
> public String getDestination(ValueInSingleWindow element) {
>   return "PROJECT_ID:DATASET_ID.TABLE_ID$20170902";
> }
> @Override
> public TableDestination getTable(String destination) {
>   TimePartitioning DAY_PARTITION = new TimePartitioning().setType("DAY");
>   return new TableDestination(destination, null, DAY_PARTITION);
> }
> @Override
> public TableSchema getSchema(String destination) {
>   return TABLE_SCHEMA;
> }
>   })
> {code}
> *Relevant Logs & Errors in StackDriver*
> {code:none}
> 23:06:26.790 
> Trying to create BigQuery table: PROJECT_ID:DATASET_ID.TABLE_ID$20170902
> 23:06:26.873 
> Invalid table ID \"TABLE_ID$20170902\". Table IDs must be alphanumeric (plus 
> underscores) and must be at most 1024 characters long. Also, Table decorators 
> cannot be used.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-3243) multiple anonymous DoFn lead to conflicting names

2017-11-26 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-3243:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> multiple anonymous DoFn lead to conflicting names
> -
>
> Key: BEAM-3243
> URL: https://issues.apache.org/jira/browse/BEAM-3243
> Project: Beam
>  Issue Type: Task
>  Components: sdk-java-core
>Reporter: Romain Manni-Bucau
>Assignee: Romain Manni-Bucau
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3200) Streaming Pipeline throws RuntimeException when using DynamicDestinations and Method.FILE_LOADS

2017-11-21 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260429#comment-16260429
 ] 

Reuven Lax commented on BEAM-3200:
--

Are the tables all belonging to the same dataset?




> Streaming Pipeline throws RuntimeException when using DynamicDestinations and 
> Method.FILE_LOADS
> ---
>
> Key: BEAM-3200
> URL: https://issues.apache.org/jira/browse/BEAM-3200
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.2.0
>Reporter: AJ
>Assignee: Reuven Lax
>Priority: Critical
>
> I am trying to use Method.FILE_LOADS for loading data into BQ in my streaming 
> pipeline using RC3 release of 2.2.0. I am writing to around 500 tables using 
> DynamicDestinations and I am also using 
> withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED). Everything works 
> fine when the first time bigquery load jobs get triggered. But on subsequent 
> triggers pipeline throws a RuntimeException about table not found even though 
> I created the pipeline with CreateDisposition.CREATE_IF_NEEDED. The exact 
> exception is:
> {code}
> java.lang.RuntimeException: Failed to create load job with id prefix 
> 717aed9ed1ef4aa7a616e1132f8b7f6d_a0928cae3d670b32f01ab2d9fe5cc0ee_1_1,
>  reached max retries: 3, last failed load job: {
>   "configuration" : {
> "load" : {
>   "createDisposition" : "CREATE_NEVER",
>   "destinationTable" : {
> "datasetId" : ...,
> "projectId" : ...,
> "tableId" : 
>   },
> "errors" : [ }
>   "message" : "Not found: Table ,
>   "reason" : "notFound"
> } ],
> {code}
> My theory is all the subsequent load jobs get trigged using CREATE_NEVER 
> disposition and 
> this might be due to 
> https://github.com/apache/beam/blob/release-2.2.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L140
> When using DynamicDestinations all the destination tables might not be known 
> during the first trigger and hence the pipeline's create disposition should 
> be respected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3200) Streaming Pipeline throws RuntimeException when using DynamicDestinations and Method.FILE_LOADS

2017-11-18 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258287#comment-16258287
 ] 

Reuven Lax commented on BEAM-3200:
--

I do see a potential bug that can trigger if any of the loads are large (over 
11TB, or more than 10,000 file). Is it possible you are hitting this?

> Streaming Pipeline throws RuntimeException when using DynamicDestinations and 
> Method.FILE_LOADS
> ---
>
> Key: BEAM-3200
> URL: https://issues.apache.org/jira/browse/BEAM-3200
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.2.0
>Reporter: AJ
>Assignee: Reuven Lax
>Priority: Critical
>
> I am trying to use Method.FILE_LOADS for loading data into BQ in my streaming 
> pipeline using RC3 release of 2.2.0. I am writing to around 500 tables using 
> DynamicDestinations and I am also using 
> withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED). Everything works 
> fine when the first time bigquery load jobs get triggered. But on subsequent 
> triggers pipeline throws a RuntimeException about table not found even though 
> I created the pipeline with CreateDisposition.CREATE_IF_NEEDED. The exact 
> exception is:
> {code}
> java.lang.RuntimeException: Failed to create load job with id prefix 
> 717aed9ed1ef4aa7a616e1132f8b7f6d_a0928cae3d670b32f01ab2d9fe5cc0ee_1_1,
>  reached max retries: 3, last failed load job: {
>   "configuration" : {
> "load" : {
>   "createDisposition" : "CREATE_NEVER",
>   "destinationTable" : {
> "datasetId" : ...,
> "projectId" : ...,
> "tableId" : 
>   },
> "errors" : [ }
>   "message" : "Not found: Table ,
>   "reason" : "notFound"
> } ],
> {code}
> My theory is all the subsequent load jobs get trigged using CREATE_NEVER 
> disposition and 
> this might be due to 
> https://github.com/apache/beam/blob/release-2.2.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L140
> When using DynamicDestinations all the destination tables might not be known 
> during the first trigger and hence the pipeline's create disposition should 
> be respected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3200) Streaming Pipeline throws RuntimeException when using DynamicDestinations and Method.FILE_LOADS

2017-11-18 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258285#comment-16258285
 ] 

Reuven Lax commented on BEAM-3200:
--

Yes, however the root cause of this bug is likely more complicated than you 
described, as from the code it should be working. I will need to reproduce this 
first myself to debug.

> Streaming Pipeline throws RuntimeException when using DynamicDestinations and 
> Method.FILE_LOADS
> ---
>
> Key: BEAM-3200
> URL: https://issues.apache.org/jira/browse/BEAM-3200
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.2.0
>Reporter: AJ
>Assignee: Reuven Lax
>Priority: Critical
>
> I am trying to use Method.FILE_LOADS for loading data into BQ in my streaming 
> pipeline using RC3 release of 2.2.0. I am writing to around 500 tables using 
> DynamicDestinations and I am also using 
> withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED). Everything works 
> fine when the first time bigquery load jobs get triggered. But on subsequent 
> triggers pipeline throws a RuntimeException about table not found even though 
> I created the pipeline with CreateDisposition.CREATE_IF_NEEDED. The exact 
> exception is:
> {code}
> java.lang.RuntimeException: Failed to create load job with id prefix 
> 717aed9ed1ef4aa7a616e1132f8b7f6d_a0928cae3d670b32f01ab2d9fe5cc0ee_1_1,
>  reached max retries: 3, last failed load job: {
>   "configuration" : {
> "load" : {
>   "createDisposition" : "CREATE_NEVER",
>   "destinationTable" : {
> "datasetId" : ...,
> "projectId" : ...,
> "tableId" : 
>   },
> "errors" : [ }
>   "message" : "Not found: Table ,
>   "reason" : "notFound"
> } ],
> {code}
> My theory is all the subsequent load jobs get trigged using CREATE_NEVER 
> disposition and 
> this might be due to 
> https://github.com/apache/beam/blob/release-2.2.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L140
> When using DynamicDestinations all the destination tables might not be known 
> during the first trigger and hence the pipeline's create disposition should 
> be respected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3200) Streaming Pipeline throws RuntimeException when using DynamicDestinations and Method.FILE_LOADS

2017-11-17 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257862#comment-16257862
 ] 

Reuven Lax commented on BEAM-3200:
--

We shuffle with Destination as they key before calling WriteTables. This means 
that each destination should have it's own independent trigger index, as 
triggers are per key.

> Streaming Pipeline throws RuntimeException when using DynamicDestinations and 
> Method.FILE_LOADS
> ---
>
> Key: BEAM-3200
> URL: https://issues.apache.org/jira/browse/BEAM-3200
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.2.0
>Reporter: AJ
>Assignee: Chamikara Jayalath
>Priority: Critical
>
> I am trying to use Method.FILE_LOADS for loading data into BQ in my streaming 
> pipeline using RC3 release of 2.2.0. I am writing to around 500 tables using 
> DynamicDestinations and I am also using 
> withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED). Everything works 
> fine when the first time bigquery load jobs get triggered. But on subsequent 
> triggers pipeline throws a RuntimeException about table not found even though 
> I created the pipeline with CreateDisposition.CREATE_IF_NEEDED. The exact 
> exception is:
> {code}
> java.lang.RuntimeException: Failed to create load job with id prefix 
> 717aed9ed1ef4aa7a616e1132f8b7f6d_a0928cae3d670b32f01ab2d9fe5cc0ee_1_1,
>  reached max retries: 3, last failed load job: {
>   "configuration" : {
> "load" : {
>   "createDisposition" : "CREATE_NEVER",
>   "destinationTable" : {
> "datasetId" : ...,
> "projectId" : ...,
> "tableId" : 
>   },
> "errors" : [ }
>   "message" : "Not found: Table ,
>   "reason" : "notFound"
> } ],
> {code}
> My theory is all the subsequent load jobs get trigged using CREATE_NEVER 
> disposition and 
> this might be due to 
> https://github.com/apache/beam/blob/release-2.2.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L140
> When using DynamicDestinations all the destination tables might not be known 
> during the first trigger and hence the pipeline's create disposition should 
> be respected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-3169) WriteFiles data loss with some triggers

2017-11-16 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax closed BEAM-3169.

Resolution: Fixed

> WriteFiles data loss with some triggers
> ---
>
> Key: BEAM-3169
> URL: https://issues.apache.org/jira/browse/BEAM-3169
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>Priority: Critical
> Fix For: 2.2.0
>
>
> https://stackoverflow.com/questions/47113773/dataflow-2-1-0-streaming-application-is-not-cleaning-temp-folders/47142671?noredirect=1#comment81401472_47142671
> Details in comments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3172) Side input reader exceptions are not properly handled when no elements are emitted

2017-11-13 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250661#comment-16250661
 ] 

Reuven Lax commented on BEAM-3172:
--

Does this need be be merged into the release branch? We're about to cut RC4.

> Side input reader exceptions are not properly handled when no elements are 
> emitted
> --
>
> Key: BEAM-3172
> URL: https://issues.apache.org/jira/browse/BEAM-3172
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 0.5.0, 0.6.0, 2.0.0, 2.1.0
>Reporter: Charles Chen
>Assignee: Charles Chen
>Priority: Blocker
> Fix For: 2.2.0
>
>
> In current versions of the Beam Python SDK, side input reader exceptions are 
> dropped when no elements are emitted.  We should fix this behavior



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3064) Update dataflow runner containers for the release branch

2017-10-30 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226128#comment-16226128
 ] 

Reuven Lax commented on BEAM-3064:
--

Since I'm about to cut RC2, should we wait for that?

> Update dataflow runner containers for the release branch
> 
>
> Key: BEAM-3064
> URL: https://issues.apache.org/jira/browse/BEAM-3064
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.2.0
>Reporter: Ahmet Altay
>Assignee: Valentyn Tymofieiev
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Blocked by:
> https://github.com/apache/beam/pull/3970
> https://github.com/apache/beam/pull/3941 - cp into release branch.
> cc: [~reuvenlax] [~robertwb] [~tvalentyn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2979) Race condition between KafkaIO.UnboundedKafkaReader.getWatermark() and KafkaIO.UnboundedKafkaReader.advance()

2017-10-30 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225730#comment-16225730
 ] 

Reuven Lax commented on BEAM-2979:
--

Do we need this for 2.2.0?

> Race condition between KafkaIO.UnboundedKafkaReader.getWatermark() and 
> KafkaIO.UnboundedKafkaReader.advance()
> -
>
> Key: BEAM-2979
> URL: https://issues.apache.org/jira/browse/BEAM-2979
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Wesley Tanaka
>Assignee: Raghu Angadi
>Priority: Blocker
> Fix For: 2.2.0
>
>
> getWatermark() looks like this:
> {noformat}
> @Override
> public Instant getWatermark() {
>   if (curRecord == null) {
> LOG.debug("{}: getWatermark() : no records have been read yet.", 
> name);
> return initialWatermark;
>   }
>   return source.spec.getWatermarkFn() != null
>   ? source.spec.getWatermarkFn().apply(curRecord) : curTimestamp;
> }
> {noformat}
> advance() has code in it that looks like this:
> {noformat}
>   curRecord = null; // user coders below might throw.
>   // apply user deserializers.
>   // TODO: write records that can't be deserialized to a 
> "dead-letter" additional output.
>   KafkaRecord record = new KafkaRecord(
>   rawRecord.topic(),
>   rawRecord.partition(),
>   rawRecord.offset(),
>   consumerSpEL.getRecordTimestamp(rawRecord),
>   keyDeserializerInstance.deserialize(rawRecord.topic(), 
> rawRecord.key()),
>   valueDeserializerInstance.deserialize(rawRecord.topic(), 
> rawRecord.value()));
>   curTimestamp = (source.spec.getTimestampFn() == null)
>   ? Instant.now() : source.spec.getTimestampFn().apply(record);
>   curRecord = record;
> {noformat}
> There's a race condition between these two blocks of code which is exposed at 
> the very least in the FlinkRunner, which calls getWatermark() periodically 
> from a timer.
> The symptom of the race condition is a stack trace that looks like this (SDK 
> 2.0.0):
> {noformat}
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:910)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:853)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:853)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: TimerException{java.lang.NullPointerException}
>   at 
> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:220)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at org.apache.beam.sdk.io.kafka.KafkaIO$Read$1.apply(KafkaIO.java:568)
>   at org.apache.beam.sdk.io.kafka.KafkaIO$Read$1.apply(KafkaIO.java:565)
>   at 
> org.apache.beam.sdk.io.kafka.KafkaIO$UnboundedKafkaReader.getWatermark(KafkaIO.java:1210)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.onProcessingTime(UnboundedSourceWrapper.java:431)
>   at 
> 

[jira] [Commented] (BEAM-3107) Python Fnapi based workloads failing

2017-10-27 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223211#comment-16223211
 ] 

Reuven Lax commented on BEAM-3107:
--

What needs to be done here?

> Python Fnapi based workloads failing
> 
>
> Key: BEAM-3107
> URL: https://issues.apache.org/jira/browse/BEAM-3107
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Affects Versions: 2.2.0
>Reporter: Ahmet Altay
>Assignee: Valentyn Tymofieiev
>  Labels: portability
> Fix For: 2.2.0
>
>
> Python post commits are failing because the runner harness is not compatible 
> with the sdk harness.
> We need a new runner harness compatible with: 
> https://github.com/apache/beam/commit/80c6f4ec0c2a3cc3a441289a9cc8ff53cb70f863



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3109) Add an element batching transform

2017-10-27 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222617#comment-16222617
 ] 

Reuven Lax commented on BEAM-3109:
--

Please submit a cherrypick PR to the 2.2.0 release branch.

> Add an element batching transform
> -
>
> Key: BEAM-3109
> URL: https://issues.apache.org/jira/browse/BEAM-3109
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Robert Bradshaw
> Fix For: 2.2.0
>
>
> Merge https://github.com/apache/beam/pull/3971 to the release branch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts

2017-10-25 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2271:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> Release guide or pom.xml needs update to avoid releasing Python binary 
> artifacts
> 
>
> Key: BEAM-2271
> URL: https://issues.apache.org/jira/browse/BEAM-2271
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Daniel Halperin
>Assignee: Sourabh Bajaj
> Fix For: 2.3.0
>
>
> The following directories (and children) were discovered in 2.0.0-RC2 and 
> were present in 0.6.0.
> {code}
> sdks/python: build   dist.eggs   nose-1.3.7-py2.7.egg  (and child 
> contents)
> {code}
> Ideally, these artifacts, which are created during setup and testing, would 
> get created in the {{sdks/python/target/}} subfolder where they will 
> automatically get ignored. More info below.
> For 2.0.0, we will manually remove these files from the source release RC3+. 
> This should be fixed before the next release.
> Here is a list of other paths that get excluded, should they be useful.
> {code}
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*]
> 
> 
>  
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties]
>   
> {code}
> This list is stored inside of this jar, which you can find by tracking 
> maven-assembly-plugin from the root apache pom: 
> https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6
> http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3029) BigTable integration tests failing on Dataflow: UserAgent must not be empty

2017-10-15 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205375#comment-16205375
 ] 

Reuven Lax commented on BEAM-3029:
--

[~chamikara] any idea what's causing this?

> BigTable integration tests failing on Dataflow: UserAgent must not be empty
> ---
>
> Key: BEAM-3029
> URL: https://issues.apache.org/jira/browse/BEAM-3029
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Kenneth Knowles
>Assignee: Chamikara Jayalath
>Priority: Blocker
> Fix For: 2.2.0
>
>
> https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/4963/org.apache.beam$beam-runners-google-cloud-dataflow-java/testReport/junit/org.apache.beam.sdk.io.gcp.bigtable/BigtableReadIT/testE2EBigtableRead/
> {code}
> java.lang.IllegalArgumentException: UserAgent must not be empty or null
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> com.google.cloud.bigtable.grpc.BigtableSession.(BigtableSession.java:233)
>   at 
> org.apache.beam.sdk.io.gcp.bigtable.BigtableServiceImpl.tableExists(BigtableServiceImpl.java:77)
>   at 
> org.apache.beam.sdk.io.gcp.bigtable.BigtableIO$Read.validate(BigtableIO.java:351)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-3029) BigTable integration tests failing on Dataflow: UserAgent must not be empty

2017-10-15 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax reassigned BEAM-3029:


Assignee: Chamikara Jayalath  (was: Daniel Oliveira)

> BigTable integration tests failing on Dataflow: UserAgent must not be empty
> ---
>
> Key: BEAM-3029
> URL: https://issues.apache.org/jira/browse/BEAM-3029
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Kenneth Knowles
>Assignee: Chamikara Jayalath
>Priority: Blocker
> Fix For: 2.2.0
>
>
> https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/4963/org.apache.beam$beam-runners-google-cloud-dataflow-java/testReport/junit/org.apache.beam.sdk.io.gcp.bigtable/BigtableReadIT/testE2EBigtableRead/
> {code}
> java.lang.IllegalArgumentException: UserAgent must not be empty or null
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>   at 
> com.google.cloud.bigtable.grpc.BigtableSession.(BigtableSession.java:233)
>   at 
> org.apache.beam.sdk.io.gcp.bigtable.BigtableServiceImpl.tableExists(BigtableServiceImpl.java:77)
>   at 
> org.apache.beam.sdk.io.gcp.bigtable.BigtableIO$Read.validate(BigtableIO.java:351)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-3039) DatastoreIO.Write fails multiple mutations of same entity

2017-10-14 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax reassigned BEAM-3039:


Assignee: Chamikara Jayalath  (was: Reuven Lax)

> DatastoreIO.Write fails multiple mutations of same entity
> -
>
> Key: BEAM-3039
> URL: https://issues.apache.org/jira/browse/BEAM-3039
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Affects Versions: 2.1.0
>Reporter: Alexander Hoem Rosbach
>Assignee: Chamikara Jayalath
>Priority: Minor
>
> When streaming messages from a source that doesn't guarantee 
> once-only-delivery, but has at-least-once-delivery, then the 
> DatastoreIO.Write will throw an exception which leads to Dataflow retrying 
> the same commit multiple times before giving up. This leads to a significant 
> bottleneck in the pipeline, with the end-result that the data is dropped. 
> This should be handled better.
> There are a number of ways to fix this. One of them could be to drop any 
> duplicate mutations within one batch. Non-duplicates should also be handled 
> in some way. Perhaps a use NON-TRANSACTIONAL commit, or make sure the 
> mutations are commited in different commits.
> {code}
> com.google.datastore.v1.client.DatastoreException: A non-transactional commit 
> may not contain multiple mutations affecting the same entity., 
> code=INVALID_ARGUMENT
> 
> com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
> 
> com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
> com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
> com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
> 
> org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
> 
> org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:1253)
>  
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2994) Refactor TikaIO

2017-10-08 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2994:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> Refactor TikaIO
> ---
>
> Key: BEAM-2994
> URL: https://issues.apache.org/jira/browse/BEAM-2994
> Project: Beam
>  Issue Type: Task
>  Components: sdk-java-extensions
>Affects Versions: 2.2.0
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
> Fix For: 2.3.0
>
>
> TikaIO is currently implemented as a BoundedSource and asynchronous 
> BoundedReader returning individual document's text chunks as Strings, 
> eventually passed unordered (and not linked to the original documents) to the 
> pipeline functions.
> It was decided in the recent beam-dev thread that initially TikaIO should 
> support the cases where only a single composite bean per file, capturing the 
> file content, location (or name) and metadata, should flow to the pipeline, 
> and thus avoiding the need to implement TikaIO as a BoundedSource/Reader.
> Enhancing  TikaIO to support the streaming of the content into the pipelines 
> may be considered in the next phase, based on the specific use-cases... 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-2954) update shade configurations in extension/sql

2017-10-07 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-2954.
--
Resolution: Fixed

> update shade configurations in extension/sql
> 
>
> Key: BEAM-2954
> URL: https://issues.apache.org/jira/browse/BEAM-2954
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>  Labels: 2.2.0, shade, sql
> Fix For: 2.2.0
>
>
> {{guava}} is shaded by default in beam modules, while {{calcite}}(a 
> dependency of SQL) includes {{guava}} which is not shaded. Below error is 
> thrown when calling {{calcite}} methods with guava classed.
> {code}
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.calcite.rel.core.Aggregate.getGroupSets()Lorg/apache/beam/sdks/java/extensions/sql/repackaged/com/google/common/collect/ImmutableList;
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.updateWindowTrigger(BeamAggregationRule.java:139)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.rule.BeamAggregationRule.onMatch(BeamAggregationRule.java:73)
>   at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
>   at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:650)
>   at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:368)
>   at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:313)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:149)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.validateAndConvert(BeamQueryPlanner.java:140)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.convertToBeamRel(BeamQueryPlanner.java:128)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:113)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlCli.compilePipeline(BeamSqlCli.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3011) Pin Runner harness container image in Python SDK

2017-10-07 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195614#comment-16195614
 ] 

Reuven Lax commented on BEAM-3011:
--

Any update on this issue? This is blocking 2.2.0 and has been unchanged for 
nearly a week.

> Pin Runner harness container image in Python SDK
> 
>
> Key: BEAM-3011
> URL: https://issues.apache.org/jira/browse/BEAM-3011
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Blocker
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-2992) Remove codepaths for reading unsplit BigQuery sources

2017-09-27 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-2992.
--
Resolution: Fixed

> Remove codepaths for reading unsplit BigQuery sources
> -
>
> Key: BEAM-2992
> URL: https://issues.apache.org/jira/browse/BEAM-2992
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-gcp
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>Priority: Minor
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2858) temp file garbage collection in BigQuery sink should be in a separate DoFn

2017-09-26 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax closed BEAM-2858.

Resolution: Fixed

> temp file garbage collection in BigQuery sink should be in a separate DoFn
> --
>
> Key: BEAM-2858
> URL: https://issues.apache.org/jira/browse/BEAM-2858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.1.0
>Reporter: Reuven Lax
>Assignee: Reuven Lax
> Fix For: 2.2.0
>
> Attachments: delete_file_diff.txt
>
>
> Currently the WriteTables transform deletes the set of input files as soon as 
> the load() job completes. However this is incorrect - if the task fails 
> partially through deleting files (e.g. if the worker crashes), the task will 
> be retried. If the write disposition is WRITE_TRUNCATE, bad things could 
> result.
> The resulting behavior will depend on what BQ does if one of input files is 
> missing (because we had previously deleted it). In the best case, BQ will 
> fail the load. In this case the step will keep failing until the runner 
> finally fails the entire job. If however BQ ignores the missing file, the 
> load will overwrite the previously-written table with the smaller set of 
> files and the job will succeed. This is the worst-case scenario, as it will 
> result in data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2377) Cross compile flink runner to scala 2.11

2017-09-22 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2377:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> Cross compile flink runner to scala 2.11
> 
>
> Key: BEAM-2377
> URL: https://issues.apache.org/jira/browse/BEAM-2377
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Ole Langbehn
>Assignee: Aljoscha Krettek
> Fix For: 2.3.0
>
>
> The flink runner is compiled for flink built against scala 2.10. flink cross 
> compiles its scala artifacts against 2.10 and 2.11.
> In order to make it possible to use beam with the flink runner in scala 2.11 
> projects, it would be nice if you could publish the flink runner for 2.11 
> next to 2.10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2298) Java WordCount doesn't work in Window OS for glob expressions or file: prefixed paths

2017-09-22 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176954#comment-16176954
 ] 

Reuven Lax commented on BEAM-2298:
--

I am resolving this for now. Please reopen if you believe the issue still 
exists.

> Java WordCount doesn't work in Window OS for glob expressions or file: 
> prefixed paths
> -
>
> Key: BEAM-2298
> URL: https://issues.apache.org/jira/browse/BEAM-2298
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Pei He
>Assignee: Flavio Fiszman
> Fix For: 2.2.0
>
>
> I am not able to build beam repo in Windows OS, so I copied the jar file from 
> my Mac.
> WordCount failed with the following cmd:
> java -cp beam-examples-java-2.0.0-jar-with-dependencies.jar
>  org.apache.beam.examples.WordCount --inputFile=input.txt --output=counts
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.FileBasedSource 
> getEstimatedSizeB
> ytes
> INFO: Filepattern input.txt matched 1 files with total size 0
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.FileBasedSource 
> expandFilePattern
> INFO: Matched 1 files for pattern input.txt
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.FileBasedSource split
> INFO: Splitting filepattern input.txt into bundles of size 0 took 0 ms and 
> produ
> ced 1 files and 0 bundles
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.WriteFiles$2 processElement
> INFO: Finalizing write operation 
> TextWriteOperation{tempDirectory=C:\Users\Pei\D
> esktop\.temp-beam-2017-05-135_13-09-48-1\, windowedWrites=false}.
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.WriteFiles$2 processElement
> INFO: Creating 1 empty output shards in addition to 0 written for a total of 
> 1.
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionExcepti
> on: java.lang.IllegalStateException: Unable to find registrar for c
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.wait
> UntilFinish(DirectRunner.java:322)
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.wait
> UntilFinish(DirectRunner.java:292)
> at 
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200
> )
> at 
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
> at org.apache.beam.examples.WordCount.main(WordCount.java:184)
> Caused by: java.lang.IllegalStateException: Unable to find registrar for c
> at 
> org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.
> java:447)
> at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)
> at 
> org.apache.beam.sdk.io.FileSystems.matchResources(FileSystems.java:17
> 4)
> at 
> org.apache.beam.sdk.io.FileSystems.filterMissingFiles(FileSystems.jav
> a:367)
> at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:251)
> at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles
> (FileBasedSink.java:641)
> at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBase
> dSink.java:529)
> at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:59
> 2)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-2298) Java WordCount doesn't work in Window OS for glob expressions or file: prefixed paths

2017-09-22 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-2298.
--
Resolution: Fixed

> Java WordCount doesn't work in Window OS for glob expressions or file: 
> prefixed paths
> -
>
> Key: BEAM-2298
> URL: https://issues.apache.org/jira/browse/BEAM-2298
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Pei He
>Assignee: Flavio Fiszman
> Fix For: 2.2.0
>
>
> I am not able to build beam repo in Windows OS, so I copied the jar file from 
> my Mac.
> WordCount failed with the following cmd:
> java -cp beam-examples-java-2.0.0-jar-with-dependencies.jar
>  org.apache.beam.examples.WordCount --inputFile=input.txt --output=counts
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.FileBasedSource 
> getEstimatedSizeB
> ytes
> INFO: Filepattern input.txt matched 1 files with total size 0
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.FileBasedSource 
> expandFilePattern
> INFO: Matched 1 files for pattern input.txt
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.FileBasedSource split
> INFO: Splitting filepattern input.txt into bundles of size 0 took 0 ms and 
> produ
> ced 1 files and 0 bundles
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.WriteFiles$2 processElement
> INFO: Finalizing write operation 
> TextWriteOperation{tempDirectory=C:\Users\Pei\D
> esktop\.temp-beam-2017-05-135_13-09-48-1\, windowedWrites=false}.
> May 15, 2017 6:09:48 AM org.apache.beam.sdk.io.WriteFiles$2 processElement
> INFO: Creating 1 empty output shards in addition to 0 written for a total of 
> 1.
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionExcepti
> on: java.lang.IllegalStateException: Unable to find registrar for c
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.wait
> UntilFinish(DirectRunner.java:322)
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.wait
> UntilFinish(DirectRunner.java:292)
> at 
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200
> )
> at 
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
> at org.apache.beam.examples.WordCount.main(WordCount.java:184)
> Caused by: java.lang.IllegalStateException: Unable to find registrar for c
> at 
> org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.
> java:447)
> at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)
> at 
> org.apache.beam.sdk.io.FileSystems.matchResources(FileSystems.java:17
> 4)
> at 
> org.apache.beam.sdk.io.FileSystems.filterMissingFiles(FileSystems.jav
> a:367)
> at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:251)
> at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles
> (FileBasedSink.java:641)
> at 
> org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBase
> dSink.java:529)
> at 
> org.apache.beam.sdk.io.WriteFiles$2.processElement(WriteFiles.java:59
> 2)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2984) Job submission too large with embedded Beam protos

2017-09-22 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176953#comment-16176953
 ] 

Reuven Lax commented on BEAM-2984:
--

Is this a 2.2.0 blocker?

> Job submission too large with embedded Beam protos
> --
>
> Key: BEAM-2984
> URL: https://issues.apache.org/jira/browse/BEAM-2984
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Empirically, naively putting context around the {{DoFnInfo}} could cause a 
> blowup of 40%, which is too much and might cause jobs that were will under 
> API size limits to start to fail.
> There's a certain amount of wiggle room since it is hard to control the 
> submission size anyhow, but 40% is way too much.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-2956) DataflowRunner incorrectly reports the user agent for the Dataflow distribution

2017-09-21 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-2956.
--
Resolution: Fixed

> DataflowRunner incorrectly reports the user agent for the Dataflow 
> distribution
> ---
>
> Key: BEAM-2956
> URL: https://issues.apache.org/jira/browse/BEAM-2956
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Luke Cwik
>Assignee: Luke Cwik
> Fix For: 2.2.0
>
>
> The DataflowRunner when distributed with the Dataflow SDK distribution may 
> incorrectly submit a user agent and properties from the Apache Beam 
> distribution.
> This occurs when the Apache Beam jars appear on the classpath before the 
> Dataflow SDK distribution. The fix is to not have two files at the same path 
> but to use two different paths, where the lack of the second path means that 
> we are using the Apache Beam distribution and its existence implies we are 
> using the Dataflow distribution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2834) NullPointerException @ BigQueryServicesImpl.java:759

2017-09-21 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax closed BEAM-2834.

Resolution: Fixed

> NullPointerException @ BigQueryServicesImpl.java:759
> 
>
> Key: BEAM-2834
> URL: https://issues.apache.org/jira/browse/BEAM-2834
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.1.0
>Reporter: Andy Barron
>Assignee: Reuven Lax
> Fix For: 2.2.0
>
>
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:759)
>   at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:809)
>   at 
> org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:126)
>   at 
> org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:96)
> {code}
> Going through the stack trace, the likely culprit is a null {{retryPolicy}} 
> in {{StreamingWriteFn}}.
> For context, this error showed up about 70 times between 1 am and 1 pm 
> Pacific time (2017-08-31) on a Dataflow streaming job.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2870) BQ Partitioned Table Write Fails When Destination has Partition Decorator

2017-09-21 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax closed BEAM-2870.

Resolution: Fixed

> BQ Partitioned Table Write Fails When Destination has Partition Decorator
> -
>
> Key: BEAM-2870
> URL: https://issues.apache.org/jira/browse/BEAM-2870
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.2.0
> Environment: Dataflow Runner, Streaming, 10 x (n1-highmem-8 & 500gb 
> SDD)
>Reporter: Steven Jon Anderson
>Assignee: Reuven Lax
>  Labels: bigquery, dataflow, google, google-cloud-bigquery, 
> google-dataflow
> Fix For: 2.2.0
>
>
> Dataflow Job ID: 
> https://console.cloud.google.com/dataflow/job/2017-09-08_23_03_14-14637186041605198816
> Tagging [~reuvenlax] as I believe he built the time partitioning integration 
> that was merged into master.
> *Background*
> Our production pipeline ingests millions of events per day and routes events 
> into our clients' numerous tables. To keep costs down, all of our tables are 
> partitioned. However, this requires that we create the tables before we allow 
> events to process as creating partitioned tables isn't supported in 2.1.0. 
> We've been looking forward to [~reuvenlax]'s partition table write feature 
> ([#3663|https://github.com/apache/beam/pull/3663]) to get merged into master 
> for some time now as it'll allow us to launch our client platforms much, much 
> faster. Today we got around to testing the 2.2.0 nightly and discovered this 
> bug.
> *Issue*
> Our pipeline writes to a table with a decorator. When attempting to write to 
> an existing partitioned table with a decorator, the write succeeds. When 
> using a partitioned table destination that doesn't exist without a decorator, 
> the write succeeds. *However, when writing to a partitioned table that 
> doesn't exist with a decorator, the write fails*. 
> *Example Implementation*
> {code:java}
> BigQueryIO.writeTableRows()
>   .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>   .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>   .withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
>   .to(new DynamicDestinations() {
> @Override
> public String getDestination(ValueInSingleWindow element) {
>   return "PROJECT_ID:DATASET_ID.TABLE_ID$20170902";
> }
> @Override
> public TableDestination getTable(String destination) {
>   TimePartitioning DAY_PARTITION = new TimePartitioning().setType("DAY");
>   return new TableDestination(destination, null, DAY_PARTITION);
> }
> @Override
> public TableSchema getSchema(String destination) {
>   return TABLE_SCHEMA;
> }
>   })
> {code}
> *Relevant Logs & Errors in StackDriver*
> {code:none}
> 23:06:26.790 
> Trying to create BigQuery table: PROJECT_ID:DATASET_ID.TABLE_ID$20170902
> 23:06:26.873 
> Invalid table ID \"TABLE_ID$20170902\". Table IDs must be alphanumeric (plus 
> underscores) and must be at most 1024 characters long. Also, Table decorators 
> cannot be used.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2345) Version configuration of plugins / dependencies in root pom.xml is inconsistent

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2345:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> Version configuration of plugins / dependencies in root pom.xml is 
> inconsistent
> ---
>
> Key: BEAM-2345
> URL: https://issues.apache.org/jira/browse/BEAM-2345
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Jason Kuster
>Assignee: Jason Kuster
>Priority: Minor
> Fix For: 2.3.0
>
>
> Versioning in root pom.xml in some places is controlled by the properties 
> section, sometimes is just inline. Move all versioning of plugins / 
> dependencies to properties section.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2956) DataflowRunner incorrectly reports the user agent for the Dataflow distribution

2017-09-19 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172626#comment-16172626
 ] 

Reuven Lax commented on BEAM-2956:
--

Can this be closed?

> DataflowRunner incorrectly reports the user agent for the Dataflow 
> distribution
> ---
>
> Key: BEAM-2956
> URL: https://issues.apache.org/jira/browse/BEAM-2956
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Luke Cwik
>Assignee: Luke Cwik
> Fix For: 2.2.0
>
>
> The DataflowRunner when distributed with the Dataflow SDK distribution may 
> incorrectly submit a user agent and properties from the Apache Beam 
> distribution.
> This occurs when the Apache Beam jars appear on the classpath before the 
> Dataflow SDK distribution. The fix is to not have two files at the same path 
> but to use two different paths, where the lack of the second path means that 
> we are using the Apache Beam distribution and its existence implies we are 
> using the Dataflow distribution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2604) Delegate beam metrics to runners

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2604:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> Delegate beam metrics to runners
> 
>
> Key: BEAM-2604
> URL: https://issues.apache.org/jira/browse/BEAM-2604
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-flink, runner-spark
>Reporter: Cody
>Assignee: Aljoscha Krettek
> Fix For: 2.3.0
>
>
> Delegate beam metrics to runners to avoid forwarding updates, i.e., extract 
> updates from beam metrics and commit updates in runners.
> For Flink/Spark runners, we can reference metrics within runner's metrics 
> system in beam pipelines and update them directly without forwarding.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2603) Add Meter in beam metrics

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2603:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> Add Meter in beam metrics
> -
>
> Key: BEAM-2603
> URL: https://issues.apache.org/jira/browse/BEAM-2603
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-core, sdk-java-core
>Reporter: Cody
>Assignee: Cody
> Fix For: 2.3.0
>
>
> 1. Add Meter interface and implementation
> 2. Add MeterData, MeterResult. Include MeterData in metric updates, and 
> MeterResult in metric query results.
> 3. Add corresponding changes regarding MeterResult and MeterData.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2576) Move non-core transform payloads out of Runner API proto

2017-09-19 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172624#comment-16172624
 ] 

Reuven Lax commented on BEAM-2576:
--

Moving to 2.3.0

> Move non-core transform payloads out of Runner API proto
> 
>
> Key: BEAM-2576
> URL: https://issues.apache.org/jira/browse/BEAM-2576
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: portability
> Fix For: 2.3.0
>
>
> The presence of e.g. WriteFilesPayload in beam_runner_api.proto makes it 
> appears as though this is a core part of the model. While it is a very 
> important transform, this is actually just a payload for a composite, like 
> any other, and should not be treated so specially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2576) Move non-core transform payloads out of Runner API proto

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2576:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> Move non-core transform payloads out of Runner API proto
> 
>
> Key: BEAM-2576
> URL: https://issues.apache.org/jira/browse/BEAM-2576
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: portability
> Fix For: 2.3.0
>
>
> The presence of e.g. WriteFilesPayload in beam_runner_api.proto makes it 
> appears as though this is a core part of the model. While it is a very 
> important transform, this is actually just a payload for a composite, like 
> any other, and should not be treated so specially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts

2017-09-19 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172623#comment-16172623
 ] 

Reuven Lax commented on BEAM-2271:
--

Can this bug be closed now?

> Release guide or pom.xml needs update to avoid releasing Python binary 
> artifacts
> 
>
> Key: BEAM-2271
> URL: https://issues.apache.org/jira/browse/BEAM-2271
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Daniel Halperin
>Assignee: Sourabh Bajaj
> Fix For: 2.2.0
>
>
> The following directories (and children) were discovered in 2.0.0-RC2 and 
> were present in 0.6.0.
> {code}
> sdks/python: build   dist.eggs   nose-1.3.7-py2.7.egg  (and child 
> contents)
> {code}
> Ideally, these artifacts, which are created during setup and testing, would 
> get created in the {{sdks/python/target/}} subfolder where they will 
> automatically get ignored. More info below.
> For 2.0.0, we will manually remove these files from the source release RC3+. 
> This should be fixed before the next release.
> Here is a list of other paths that get excluded, should they be useful.
> {code}
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*]
> 
> 
>  
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties]
>   
> {code}
> This list is stored inside of this jar, which you can find by tracking 
> maven-assembly-plugin from the root apache pom: 
> https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6
> http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-1868) CreateStreamTest testMultiOutputParDo is flaky on the Spark runner

2017-09-19 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172620#comment-16172620
 ] 

Reuven Lax commented on BEAM-1868:
--

This still hasn't been worked on AFAICT, so bumping to 2.3.0.

> CreateStreamTest testMultiOutputParDo is flaky on the Spark runner
> --
>
> Key: BEAM-1868
> URL: https://issues.apache.org/jira/browse/BEAM-1868
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Thomas Groh
>  Labels: flake
> Fix For: 2.3.0
>
>
> Example excerpt from a Jenkins failure:
> {code}
> Expected: iterable over [<1>, <2>, <3>] in any order
>  but: No item matches: <1>, <2>, <3> in []
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:170)
>   at org.apache.beam.sdk.testing.PAssert.that(PAssert.java:366)
>   at org.apache.beam.sdk.testing.PAssert.that(PAssert.java:358)
>   at 
> org.apache.beam.runners.spark.translation.streaming.CreateStreamTest.testMultiOutputParDo(CreateStreamTest.java:387)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-1868) CreateStreamTest testMultiOutputParDo is flaky on the Spark runner

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-1868:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> CreateStreamTest testMultiOutputParDo is flaky on the Spark runner
> --
>
> Key: BEAM-1868
> URL: https://issues.apache.org/jira/browse/BEAM-1868
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Thomas Groh
>  Labels: flake
> Fix For: 2.3.0
>
>
> Example excerpt from a Jenkins failure:
> {code}
> Expected: iterable over [<1>, <2>, <3>] in any order
>  but: No item matches: <1>, <2>, <3> in []
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:170)
>   at org.apache.beam.sdk.testing.PAssert.that(PAssert.java:366)
>   at org.apache.beam.sdk.testing.PAssert.that(PAssert.java:358)
>   at 
> org.apache.beam.runners.spark.translation.streaming.CreateStreamTest.testMultiOutputParDo(CreateStreamTest.java:387)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-2829) Add ability to set job labels in DataflowPipelineOptions

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax resolved BEAM-2829.
--
Resolution: Fixed

> Add ability to set job labels in DataflowPipelineOptions
> 
>
> Key: BEAM-2829
> URL: https://issues.apache.org/jira/browse/BEAM-2829
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-dataflow
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Zongwei Zhou
>Assignee: Zongwei Zhou
>Priority: Minor
> Fix For: 2.2.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Enable setting job labels --labels in DataflowPipelineOptions (earlier 
> Dataflow SDK 1.x supports this)
> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2829) Add ability to set job labels in DataflowPipelineOptions

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax closed BEAM-2829.


> Add ability to set job labels in DataflowPipelineOptions
> 
>
> Key: BEAM-2829
> URL: https://issues.apache.org/jira/browse/BEAM-2829
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-dataflow
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Zongwei Zhou
>Assignee: Zongwei Zhou
>Priority: Minor
> Fix For: 2.2.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Enable setting job labels --labels in DataflowPipelineOptions (earlier 
> Dataflow SDK 1.x supports this)
> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2273) mvn clean doesn't fully clean up archetypes.

2017-09-19 Thread Reuven Lax (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reuven Lax updated BEAM-2273:
-
Fix Version/s: (was: 2.2.0)
   2.3.0

> mvn clean doesn't fully clean up archetypes.
> 
>
> Key: BEAM-2273
> URL: https://issues.apache.org/jira/browse/BEAM-2273
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Jason Kuster
>Assignee: Jason Kuster
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >