Re: [RFC][Design] Watermark Estimation in the Go Sdk

2022-03-31 Thread Robert Burke
I like the approach, and the doc! I've added some questions/go specific
nits and questions.

I'm lifting one question here for information gathering:

Do streaming SDFs in Java/Python require event times be emitted with every
element, or is it defaulted to past End of GlobalWindow?

On Thu, Mar 31, 2022, 9:19 AM Jack McCluskey  wrote:

> Looks good, Danny!
>
> On Wed, Mar 30, 2022 at 8:22 PM Robert Bradshaw 
> wrote:
>
>> Thanks. I didn't look much at the implementation, but the API makes
>> sense to me.
>>
>> On Wed, Mar 30, 2022 at 10:23 AM Danny McCormick
>>  wrote:
>> >
>> > Hey everyone, I put together a design doc for adding Watermark
>> Estimation in the Go Sdk and would appreciate any thoughts you have!
>> >
>> > Watermark estimation allows a DoFn to estimate the lower bound of data
>> yet to be emitted so that future steps can perform aggregation without
>> waiting for the DoFn to complete. Right now, both Java and Python support
>> watermark estimation by allowing users to specify a custom watermark
>> estimator that is invoked when an element/restriction pair is split or
>> checkpointed to produce a watermark estimate. They also allow users to
>> specify custom watermark state that is persisted across splits and
>> checkpoints. This document proposes introducing similar DoFn lifecycle
>> methods for creating and managing watermark estimators and state in the Go
>> Sdk.
>> >
>> > Please share any feedback here:
>> https://docs.google.com/document/d/1DqCYJ-J1YGNelCRIcN5v6BQsZxJB2l5uWBmN4ti--Ew/edit?usp=sharing
>> >
>> > Thanks,
>> > Danny
>>
>


Beam custom coder

2022-03-31 Thread Fu Yi (BT - AI, Analytics & Data Domain)
Hello guys,

I am trying to find a example how to define deterministic coder for namedtuple 
but fail to find any. In this official doc 
https://beam.apache.org/documentation/sdks/python-type-safety/#kinds-of-type-hints,
 it says code example is included for the PlayerCoder, but there is actually no 
code in that page. Is this something under development?

Can anyone point me in the right direction?

Thanks,

Best regards,
Yi


The information contained in this e-mail message may be privileged, 
confidential, and protected from disclosure. Any unauthorized use, printing, 
copying, disclosure or dissemination of this communication may be subject to 
legal restriction or sanction. If you think that you have received this e-mail 
message in error, please reply to the sender and delete this message from your 
computer.


General Information


Flaky test issue report (52)

2022-03-31 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-14216: Multiple XVR Suites 
having similar flakes simultaneously (created 2022-03-31)
https://issues.apache.org/jira/browse/BEAM-14172: beam_PreCommit_PythonDocs 
failing (jinja2) (created 2022-03-24)
https://issues.apache.org/jira/browse/BEAM-13952: Dataflow streaming tests 
failing new AfterSynchronizedProcessingTime test (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13859: Test flake: 
test_split_half_sdf (created 2022-02-09)
https://issues.apache.org/jira/browse/BEAM-13850: 
beam_PostCommit_Python_Examples_Dataflow failing (created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13822: GBK and CoGBK streaming 
Java load tests failing (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13810: Flaky tests: Gradle build 
daemon disappeared unexpectedly (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13809: beam_PostCommit_XVR_Flink 
flaky: Connection refused (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13797: Flakes: Failed to load 
cache entry (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13783: 
apache_beam.transforms.combinefn_lifecycle_test.LocalCombineFnLifecycleTest.test_combine
 is flaky (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13741: 
:sdks:java:extensions:sql:hcatalog:compileJava failing in 
beam_Release_NightlySnapshot  (created 2022-01-25)
https://issues.apache.org/jira/browse/BEAM-13708: flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored (created 2022-01-20)
https://issues.apache.org/jira/browse/BEAM-13575: Flink 
testParDoRequiresStableInput flaky (created 2021-12-28)
https://issues.apache.org/jira/browse/BEAM-13519: Java precommit flaky 
(timing out) (created 2021-12-22)
https://issues.apache.org/jira/browse/BEAM-13500: NPE in Flink Portable 
ValidatesRunner streaming suite (created 2021-12-21)
https://issues.apache.org/jira/browse/BEAM-13453: Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use 
(created 2021-12-13)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13367: 
[beam_PostCommit_Python36] [ 
apache_beam.io.gcp.experimental.spannerio_read_it_test] Failure summary 
(created 2021-12-01)
https://issues.apache.org/jira/browse/BEAM-13312: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
 is flaky in Java Spark ValidatesRunner suite  (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13311: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite. (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13237: 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2 (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13234: Flake in 
StreamingWordCountIT.test_streaming_wordcount_it (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13025: pubsublite.ReadWriteIT 
flaky in beam_PostCommit_Java_DataflowV2   (created 2021-10-08)
https://issues.apache.org/jira/browse/BEAM-12928: beam_PostCommit_Python36 
- CrossLanguageSpannerIOTest - flakey failing (created 2021-09-21)
https://issues.apache.org/jira/browse/BEAM-12859: 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12858: 
org.apache.beam.sdk.io.gcp.datastore.RampupThrottlingFnTest.testRampupThrottler 
is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12809: 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky (created 2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12794: 
PortableRunnerTestWithExternalEnv.test_pardo_timers flaky (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12793: 
beam_PostRelease_NightlySnapshot failed (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12673: 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey (created 2021-07-28)
https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in 

P1 issues report (70)

2022-03-31 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-14191: CrossLanguageJdbcIOTest 
broken with "Cannot load JDBC driver class 'com.mysql.cj.jdbc.Driver'" (created 
2022-03-28)
https://issues.apache.org/jira/browse/BEAM-14138: Python PostCommit BQ test 
failures due to NOT_FOUND for Dataset (created 2022-03-21)
https://issues.apache.org/jira/browse/BEAM-14135: BigQuery Storage API 
insert with writeResult retry and write to error table (created 2022-03-20)
https://issues.apache.org/jira/browse/BEAM-14126: Python 3.10 Support 
(created 2022-03-18)
https://issues.apache.org/jira/browse/BEAM-14064: ElasticSearchIO#Write 
buffering and outputting across windows (created 2022-03-07)
https://issues.apache.org/jira/browse/BEAM-14017: 
beam_PreCommit_CommunityMetrics_Cron is failing. (created 2022-03-01)
https://issues.apache.org/jira/browse/BEAM-13953: Document BigQueryIO 
Storage Write API methods (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13952: Dataflow streaming tests 
failing new AfterSynchronizedProcessingTime test (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13950: PVR_Spark2_Streaming 
perma-red (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13920: Beam x-lang Dataflow 
tests failing due to _InactiveRpcError (created 2022-02-10)
https://issues.apache.org/jira/browse/BEAM-13852: 
KafkaIO.read.withDynamicRead() doesn't pick up new TopicPartitions (created 
2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13850: 
beam_PostCommit_Python_Examples_Dataflow failing (created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13822: GBK and CoGBK streaming 
Java load tests failing (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13805: Simplify version override 
for Dev versions of the Go SDK. (created 2022-02-02)
https://issues.apache.org/jira/browse/BEAM-13798: Upgrade Kubernetes 
Clusters (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13747: Add integration testing 
for BQ Storage API  write modes (created 2022-01-26)
https://issues.apache.org/jira/browse/BEAM-13741: 
:sdks:java:extensions:sql:hcatalog:compileJava failing in 
beam_Release_NightlySnapshot  (created 2022-01-25)
https://issues.apache.org/jira/browse/BEAM-13715: Kafka commit offset drop 
data on failure for runners that have non-checkpointing shuffle (created 
2022-01-21)
https://issues.apache.org/jira/browse/BEAM-13582: Beam website precommit 
mentions broken links, but passes. (created 2021-12-30)
https://issues.apache.org/jira/browse/BEAM-13487: WriteToBigQuery Dynamic 
table destinations returns wrong tableId (created 2021-12-17)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13164: Race between member 
variable being accessed due to leaking uninitialized state via 
OutboundObserverFactory (created 2021-11-01)
https://issues.apache.org/jira/browse/BEAM-13132: WriteToBigQuery submits a 
duplicate BQ load job if a 503 error code is returned from googleapi (created 
2021-10-27)
https://issues.apache.org/jira/browse/BEAM-13087: 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible 
(created 2021-10-20)
https://issues.apache.org/jira/browse/BEAM-13078: Python DirectRunner does 
not emit data at GC time (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13076: Python AfterAny, AfterAll 
do not follow spec (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13010: Delete orphaned files 
(created 2021-10-06)
https://issues.apache.org/jira/browse/BEAM-12995: Consumer group with 
random prefix (created 2021-10-04)
https://issues.apache.org/jira/browse/BEAM-12959: Dataflow error in 
CombinePerKey operation (created 2021-09-26)
https://issues.apache.org/jira/browse/BEAM-12867: Either Create or 
DirectRunner fails to produce all elements to the following transform (created 
2021-09-09)
https://issues.apache.org/jira/browse/BEAM-12843: (Broken Pipe induced) 
Bricked Dataflow Pipeline  (created 2021-09-06)
https://issues.apache.org/jira/browse/BEAM-12807: Java creates an incorrect 
pipeline proto when core-construction-java jar is not in the CLASSPATH (created 
2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12799: "Java IO IT Tests" - 
missing data in grafana (created 2021-08-25)
ht

Re: [PROPOSAL] Stop Spark2 support in Spark Runner

2022-03-31 Thread Alexey Romanenko


> On 31 Mar 2022, at 18:02, Robert Bradshaw  wrote:
> 
> Generally makes sense to me, though I'm curious what the maintenance
> burden is *high or low) in keeping it around.

Well, we need to provide two versions of spark runner artifacts, job-servers 
and docker images, to test them separately (different Jenkins jobs). We also 
have two different code paths for the cases where API is not compatible between 
Spark2 and Spark3.  

> We should probably
> deprecate it for a period of time before removing support.

Agree and I’d suggest even ask users on user@/twitter before. 


Actually, I see some problem with naming. By default, we used to call “Spark 
runner” as a runner that works with Spark2 (for example, the artifacts [1][2]). 
When Spark3 support was added, all its Beam artifacts and related names reflect 
its version [3][4]. So, it’s not clear how it will be better to deal with this, 
especially, taking into account, that new Spark version (4, 5, etc) will be 
available sooner or later. Perhaps, to avoid a confusion in the future, we need 
to follow the same naming pattern.

—
Alexey

[1] https://search.maven.org/artifact/org.apache.beam/beam-runners-spark
[2] 
https://search.maven.org/artifact/org.apache.beam/beam-runners-spark-job-server
[3] https://search.maven.org/artifact/org.apache.beam/beam-runners-spark-3
[4] 
https://search.maven.org/artifact/org.apache.beam/beam-runners-spark-3-job-server

> 
> On Thu, Mar 31, 2022 at 8:52 AM Alexey Romanenko
>  wrote:
>> 
>> Hi everyone,
>> 
>> For the moment, Beam Spark Runner supports two versions of Spark - 2.x and 
>> 3.x.
>> 
>> Taking into account the several things that:
>> - almost all cloud providers already mostly moved to Spark 3.x as a main 
>> supported version;
>> - the latest Spark 2.x release (Spark 2.4.8, maintenance release) was done 
>> almost a year ago;
>> - Spark 3 is considered as a mainstream Spark version for development and 
>> bug fixing;
>> - better to avoid the burden of maintenance (there are some 
>> incompatibilities between Spark 2 and 3) of two versions;
>> 
>> I’d suggest to stop support Spark 2 for the Spark Runner in the one of the 
>> next Beam releases.
>> 
>> What are your thoughts on this? Are there any principal objections or 
>> reasons for not doing this that I probably missed?
>> 
>> —
>> Alexey
>> 
>> 



[CdapIO] CDAP connector updates and code reviews

2022-03-31 Thread Elizaveta Lomteva
Hi, community!


Our team is working on the new CdapIO connector implementation and we prepared 
initial PRs for components of the CdapIO package. We are reviewing the design 
document [1] with members of CDAP, and working towards resolving comments and 
having CDAP plugin artifacts published to Maven, so we could include them as 
dependencies (this delays finalizing some PRs like [2]).


We prepared PRs for CDAP context classes and plugin wrapper classes. While they 
depend on [2], wanted to ask if someone would be open to help with starting on 
code reviews?


PRs ready for code review:

[3] CDAP context classes for CDAP plugins 
https://github.com/apache/beam/pull/17104

[https://opengraph.githubassets.com/9260a94844067566a77c159de11543f854fc4c3b1a735395873f5460c9beaeb2/apache/beam/pull/17104]

[BEAM-14081] [CdapIO] Add context classes for CDAP plugins by Krasavinigor · 
Pull Request #17104 · apache/beam · 
GitHub
github.com
WARNING! This PR depends on PR #17051 Changes: Implements classes for 
BatchSourceContext, BatchSinkContext and StreamingSourceContext interfaces. 
Implements BatchContextImpl as part of has-a references for classes. Implements 
tests. We move all common logic to BatchContextImpl which implements the CDAP 
interface BatchContext. Each class extends BatchContextImpl and can override it 
if needed and contains its own logic. After creating the plugin config and an 
instance of [PluginName]BatchSink/BatchSource/StreamingSource we pass our 
context through the prepareRun method for in instance. Thank you for your 
contribution! Follow this checklist to help us incorporate your contribution 
quickly and easily: Choose reviewer(s) and mention them in a comment (R: 
@username). Format the pull request title like [BEAM-XXX] Fixes bug in 
ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue. Update



[4] CDAP plugin wrapper classes https://github.com/apache/beam/pull/17150

[https://opengraph.githubassets.com/d3edcbc6791ade4994b3cfab1d0b97494257c961fa514a97f0491537311ba7b1/apache/beam/pull/17150]

[BEAM-14053] [CdapIO] Add wrapper class for CDAP plugin by ktttnv · Pull 
Request #17150 · apache/beam · GitHub
github.com
Changes: Added wrapper classes for source and sink CDAP plugins Added builder 
for plugin wrapper class Added possibility to provide Hadoop configuration 
Warning: This PR contains changes from PR "[BEAM-14048] [CdapIO] Add 
ConfigWrapper for building CDAP PluginConfigs" and should be merged after it 
Thank you for your contribution! Follow this checklist to help us incorporate 
your contribution quickly and easily: Choose reviewer(s) and mention them in a 
comment (R: @username). Format the pull request title like [BEAM-XXX] Fixes bug 
in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue. Update CHANGES.md with noteworthy changes. If this contribution is 
large, please file an Apache Individual Contributor License Agreement. See the 
Contributor Guide for more tips on how to make review process smoother. To 
check the build health, please visit https://github.com/apache/beam




We would appreciate it very much if you could review ready PRs and leave 
comments.


P.S. CommunityMetrics PR check is finishing with an error every time. We don't 
clearly understand the issue with this check. Any guidance on that would be 
helpful?


Thank you for your attention to it,

Elizaveta


[1] TDD 
https://docs.google.com/document/d/1T-bhd0Qk7DBePIfgHEPagYiA1oLP4z5kYEd0S1SOGxQ/edit?usp=sharing

[2] CDAP plugin's configuration wrapper 
https://github.com/apache/beam/pull/17051, 
 
https://issues.apache.org/jira/browse/BEAM-14048

[https://opengraph.githubassets.com/2568e6c2ec87aea825c75e3f7adf9808ee50f571bd44d034da842d1663066153/apache/beam/pull/17051]

[BEAM-14048] [CdapIO] Add ConfigWrapper for building CDAP PluginConfigs by 
Amar3tto · Pull Request #17051 · apache/beam · 
GitHub
github.com
CDAP plugins use their own configuration classes (ex. SalesforceSourceConfig, 
SalesforceStreamingSourceConfig) to set the required parameters to prepare and 
run pipeline in Source or Sink classes. All these config classes are extended 
from the base io.cdap.cdap.api.plugin.PluginConfig class. As a first step in 
integration of CDAP plugins and Apache Beam, there should be implemented a 
mechanism to create an instance of the specific PluginConfig implementation. 
For this purpose, suggested solution provides a ConfigWrapper class. Apache 
Beam user creates an instance of this class by passing the

Re: [PROPOSAL] Stop Spark2 support in Spark Runner

2022-03-31 Thread Jean-Baptiste Onofré
+1 for me to drop Spark 2.x support.

Users who want to still use Spark 2.x can be back on a previous Beam release.

Regards
JB

On Thu, Mar 31, 2022 at 5:51 PM Alexey Romanenko
 wrote:
>
> Hi everyone,
>
> For the moment, Beam Spark Runner supports two versions of Spark - 2.x and 
> 3.x.
>
> Taking into account the several things that:
> - almost all cloud providers already mostly moved to Spark 3.x as a main 
> supported version;
> - the latest Spark 2.x release (Spark 2.4.8, maintenance release) was done 
> almost a year ago;
> - Spark 3 is considered as a mainstream Spark version for development and bug 
> fixing;
> - better to avoid the burden of maintenance (there are some incompatibilities 
> between Spark 2 and 3) of two versions;
>
> I’d suggest to stop support Spark 2 for the Spark Runner in the one of the 
> next Beam releases.
>
> What are your thoughts on this? Are there any principal objections or reasons 
> for not doing this that I probably missed?
>
> —
> Alexey
>
>


[PROPOSAL] Stop Spark2 support in Spark Runner

2022-03-31 Thread Alexey Romanenko
Hi everyone,

For the moment, Beam Spark Runner supports two versions of Spark - 2.x and 3.x. 

Taking into account the several things that:
- almost all cloud providers already mostly moved to Spark 3.x as a main 
supported version;
- the latest Spark 2.x release (Spark 2.4.8, maintenance release) was done 
almost a year ago;
- Spark 3 is considered as a mainstream Spark version for development and bug 
fixing;
- better to avoid the burden of maintenance (there are some incompatibilities 
between Spark 2 and 3) of two versions; 

I’d suggest to stop support Spark 2 for the Spark Runner in the one of the next 
Beam releases. 

What are your thoughts on this? Are there any principal objections or reasons 
for not doing this that I probably missed?

—
Alexey