Re: Beam job details not available on Spark History Server

2022-02-24 Thread Ismaël Mejía
Hello Jozef this change was not introduced in the PR you referenced,
that PR was just a refactor.
The conflicting change was added in [1] via [2] starting on Beam 2.29.0.

It is not clear for me why this was done but maybe Kyle Weaver or
someone else have a better context.

Let's continue the discussion on the issue. For me the right solution
is to remove the extra event logging listener but I might be missing
something, in all cases it does not seem the runner is the best place
to deal with this logic, this seems error-prone.

[1] https://github.com/apache/beam/pull/13743/
[2] https://github.com/apache/beam/commit/291ced166af

On Wed, Feb 23, 2022 at 3:41 PM Jozef Vilcek  wrote:
>
> I would like to discuss a problem I am facing upgrading Beam 2.24.0 -> 2.33.0.
>
> Running Beam batch jobs on SparkRunner with Spark 2.4.4 stopped showing me 
> job details on Spark History Server. Problem is that there are 2 event 
> logging. listener running and they step on each other. More details in [1]. 
> One is run by Spark itself, the other is started by Beam, which was added by 
> MR [2].
>
> My first question is towards understanding why there is Spark's even logging 
> listener started manually within Beam next to the one started by Spark 
> Context internally?
>
>
> [1] https://issues.apache.org/jira/browse/BEAM-13981
> [2] https://github.com/apache/beam/pull/14409
>
>


Re: [Question] MacOS self-hosted servers - BEAM-12812

2022-02-24 Thread Daniela Martín
Hi,

Thank you very much for your comments and suggestions, totally agree with
you both. We will discuss it with the rest of the team and let you know the
resolution.

Thanks.


Regards,

On Wed, Feb 23, 2022 at 8:05 PM Ahmet Altay  wrote:

> Hi Daniela,
>
> My suggestion would be to rely on github provided runners and avoid self
> hosted runners for macos builds. We would like to use builtin support in
> these platforms (GH, GCP etc.) and not build our own systems because of the
> complications you are mentioning. I do not think adding AWS based self
> hosted runners would be worth the complexity just for this either.
>
> > Also, we would like to know if there is any information regarding Google
> Cloud Platform having a macOS image anytime soon for running it in VM or
> containers, as we think this may be the best approach for this task.
>
> People working at Google won't be able to answer this question. There is
> no public information about this. I agree that it would be the best
> approach but it is not clear if/when it will be available. I would not
> recommend waiting for it.
>
> Ahmet
>
>
> On Wed, Feb 23, 2022 at 5:46 PM Danny McCormick 
> wrote:
>
>> Unfortunately, Apple is pretty hardcore about licensing such that AWS,
>> Mac Stadium, or buying/hosting dedicated macs are pretty much the only good
>> options AFAIK. That was a pain in the butt for the multiple CI systems I've
>> worked on in the past.
>>
>> > Is this approach completely allowed according to Apple’s license?
>>
>> Almost definitely not - Apple's OS licensing requires every OS to be run
>> on "Apple-branded" hardware - e.g. from the Catalina license (
>> https://www.apple.com/legal/sla/docs/macOSCatalina.pdf)
>>
>> "The grants set forth in this License do not permit you to, and you
>> agree not to, install, use or run the Apple Software on any
>> non-Apple-branded computer, or to enable others to do so."
>>
>> The same presumably applies to the Hackintosh approach.
>>
>> Disclaimer - I'm not a lawyer, but I have had lawyers say my team
>> couldn't do something like this in the past 
>>
>> Thanks,
>> Danny
>>
>> On Wed, Feb 23, 2022 at 5:44 PM Daniela Martín <
>> daniela.mar...@wizeline.com> wrote:
>>
>>> Hi everyone,
>>>
>>> We are currently working on *BEAM-12812 Run GitHub Actions on GCP
>>> self-hosted workers* [1] task, and we would like to know your thoughts
>>> regarding the Mac OS runners.
>>>
>>> Some context of the task
>>>
>>> The current GitHub Actions workflows are being tested on multiple
>>> operating systems, such as Ubuntu, Windows and MacOS. The way to migrate
>>> these runners from GitHub-hosted to GCP is by implementing self-hosted
>>> runners, so we have started implementing them in both Ubuntu and Windows
>>> environments, going with Google Kubernetes Engine and Google Cloud Compute
>>> VMs instances respectively.
>>>
>>> Findings
>>>
>>> In addition, we are working on researching the best way to implement the
>>> MacOS self-hosted runners, concluding with the following approaches:
>>>
>>>-
>>>
>>>Cloud Virtual Machines Support
>>>-
>>>
>>>Mac OS X in Docker
>>>-
>>>
>>>Hackintosh
>>>
>>>
>>> Cloud VM Support
>>>
>>> We have found that there are other Cloud Providers, such as AWS [2],
>>> that allow us to host Mac OS instances in our own dedicated hosts using
>>> official Apple hardware. However, we don’t have any Mac OS image available
>>> in Google Cloud Platform [3] yet.
>>> Mac OS X in Docker
>>>
>>> A Docker image docker-osx [4] is available in Docker Hub for running
>>> Mac OS X in a Docker container.
>>>
>>> Pros
>>>
>>> Cons
>>>
>>>-
>>>
>>>macOS Monterey VM on Linux
>>>-
>>>
>>>Near-native performance
>>>-
>>>
>>>Multiple versions of mac OS: High Sierra, Mojave, Catalina, Big Sur
>>>and Monterey
>>>-
>>>
>>>Multiple kind of images depending on the use case
>>>-
>>>
>>>Runs on top of QEMU + KVM
>>>-
>>>
>>>Supports Kubernetes
>>>
>>>
>>>-
>>>
>>>Is this approach completely allowed according to Apple’s license?
>>>-
>>>
>>>Unverified Docker Hub publisher
>>>-
>>>
>>>Hardware virtualization enabled in BIOS
>>>-
>>>
>>>Approx 20 GB disk space for minimum installation
>>>
>>>
>>>
>>> Hackintosh
>>>
>>> A way to get Mac OS running on hardware that is not authorized by Apple.
>>> The creation and configuration of the equipment (using GNU/Linux as a base)
>>> can become very complicated resulting in a malfunctioning OS if the
>>> required settings are not properly implemented or the hardware is not
>>> suitable.
>>>
>>>
>>> In conclusion, we found that there are some ways to run macOS in
>>> self-hosted runners, however this could conflict with Apple's terms and
>>> licenses and should be investigated in depth before any implementation.
>>>
>>> The question here would be, if someone knows any other approach where we
>>> could run Mac OS in the Cloud following Apple's licenses.
>>>
>>> Also, we would 

Re: [Question] Dataproc 1.5 - Flink version conflict

2022-02-24 Thread Andoni Guzman Becerra
Yes, I tried with Dataproc 2.0 and Flink 1.12. Cluster creation was fine.
But at the moment of start Groovy testing fails. This is the error.

*Task :sdks:python:apache_beam:testing:load_tests:run**16:44:49*
INFO:apache_beam.testing.load_tests.load_test_metrics_utils:Missing
InfluxDB options. Metrics will not be published to InfluxDB*16:44:50*
WARNING:root:Make sure that locally built Python SDK docker image has
Python 3.7 interpreter.*16:44:50* INFO:root:Default Python SDK image
for environment is apache/beam_python3.7_sdk:2.38.0.dev*16:44:50*
INFO:root:Using provided Python SDK container image:
gcr.io/apache-beam-testing/beam_portability/beam_python3.7_sdk:latest*16:44:50*
INFO:root:Python SDK container image set to
"gcr.io/apache-beam-testing/beam_portability/beam_python3.7_sdk:latest"
for Docker environment*16:44:50*
WARNING:apache_beam.options.pipeline_options:Discarding unparseable
args: ['--fanout=4', '--top_count=20',
'--use_stateful_load_generator']*16:44:51*
INFO:apache_beam.runners.portability.portable_runner:Job state changed
to STOPPED*16:44:51*
INFO:apache_beam.runners.portability.portable_runner:Job state changed
to STARTING*16:44:51*
INFO:apache_beam.runners.portability.portable_runner:Job state changed
to RUNNING*16:45:14* ERROR:root:java.lang.NoClassDefFoundError: Could
not initialize class
org.apache.beam.runners.core.construction.SerializablePipelineOptions*16:45:15*
INFO:apache_beam.runners.portability.portable_runner:Job state changed
to FAILED*16:45:15* Traceback (most recent call last):*16:45:15*
File "/usr/lib/python3.7/runpy.py", line 193, in
_run_module_as_main*16:45:15* "__main__", mod_spec)*16:45:15*
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code*16:45:15*
 exec(code, run_globals)*16:45:15*   File
"/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_Combine_Flink_Streaming_PR/src/sdks/python/apache_beam/testing/load_tests/combine_test.py",
line 129, in *16:45:15* CombineTest().run()*16:45:15*
File 
"/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_Combine_Flink_Streaming_PR/src/sdks/python/apache_beam/testing/load_tests/load_test.py",
line 151, in run*16:45:15*
self.result.wait_until_finish(duration=self.timeout_ms)*16:45:15*
File 
"/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_Combine_Flink_Streaming_PR/src/sdks/python/apache_beam/runners/portability/portable_runner.py",
line 600, in wait_until_finish*16:45:15* raise
self._runtime_exception*16:45:15* RuntimeError: Pipeline
load-tests-python-flink-streaming-combine-4-0223222352_89cc297d-d8b8-45cc-8ebc-97f5a50f43a8
failed in state FAILED: java.lang.NoClassDefFoundError: Could not
initialize class
org.apache.beam.runners.core.construction.SerializablePipelineOptions*16:45:15*
*16:45:15* >* Task :sdks:python:apache_beam:testing:load_tests:run*
FAILED



I'm trying to find the reason for this error, could it be the version? Some
dependency? At this moment I didn't find too much info about it.
This could help
https://stackoverflow.com/questions/62438382/run-beam-pipeline-with-flink-yarn-session-on-emr

Any help would be appreciated , thanks a lot

On Mon, Feb 14, 2022 at 2:42 PM Kyle Weaver  wrote:

> Can we use Dataproc 2.0, which supports Flink 1.12? (Since Flink 1.12 is
> still supported by Beam)
>
> On Mon, Feb 14, 2022 at 11:20 AM Andoni Guzman Becerra <
> andoni.guz...@wizeline.com> wrote:
>
>> Hi All, I'm working trying to re-enable some tests like
>> LoadTests_Combine_Flink_Python.groovy and fix some vms leaked in those
>> tests. https://issues.apache.org/jira/browse/BEAM-12898
>> The version of dataproc used before was 1.2 and now it's 1.5.
>> The problem is that dataproc 1.5  flink version is 1.9 and actually we
>> use flink 1.13. Causing a mismatch and error running the tests.
>> In dataproc 1.2 a init script was passed with all the info related with
>> flink version, but now in optional components only told the component to
>> install
>>
>> This was the way to create a cluster in dataproc 1.2
>>
>>  gcloud dataproc clusters create $CLUSTER_NAME --region=global
>> --num-workers=$num_dataproc_workers --initialization-actions
>> $DOCKER_INIT,$BEAM_INIT,$FLINK_INIT --metadata "${metadata}",
>> --image-version=$image_version --zone=$GCLOUD_ZONE --quiet
>>
>> And this is the way to do it in dataproc 1.5
>>
>> gcloud dataproc clusters create $CLUSTER_NAME --region=global
>> --num-workers=$num_dataproc_workers  --metadata "${metadata}",
>> --image-version=$image_version --zone=$GCLOUD_ZONE
>>  --optional-components=FLINK,DOCKER  --quiet--
>>
>> There is a way to force the flink version in dataproc ? I tried to use
>> Flink_init with initialization action but it didn't work.
>>
>> Any help would be appreciated.
>>
>> Thank you!
>>
>> Andoni Guzman | WIZELINE
>>
>> Software Engineer II
>>
>> andoni.guz...@wizeline.com
>>
>> Amado Nervo 2200, Esfera P6, Col. Ciudad del Sol, 45050 Zapopan, Jal.
>>
>>
>>
>>
>>
>>
>>
>>
>> *This email and its contents (including any attachments) 

Support for beam-plugins in Python Runner v2

2022-02-24 Thread Rahul Iyer
Good Morning/Afternoon/Evening folks,

The current support for beam-plugins is experimental and we would like to
have it as a first class member of the beam library for Python Runner v2.
This helps us load plugins into the runtime before starting the SdkHarness.
https://github.com/apache/beam/pull/16920 is a PR I created for this
purpose. Wanted to gather some thoughts around the approach here and have
it standardized. The current implementation of beam plugins allows users to
extend a class from BeamPlugin and it gets automatically populated in the
--beam_plugin PipelineOption, e.g.: FileSystem
.
This creates the pipeline option as,

--beam_plugin=[

  'apache_beam.io.aws.s3filesystem.S3FileSystem',

  'apache_beam.io.filesystem.FileSystem',

  'apache_beam.io.hadoopfilesystem.HadoopFileSystem',

  'apache_beam.io.localfilesystem.LocalFileSystem',

  'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',

  'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'

]

Another way is to provide a module via the --beam_plugin PipelineOption,
e.g.:

--beam_plugin='twitter.beam.rule_the_world'

The current implementation in the PR supports both these approaches but
would love to have a standardized way forward and have it documented. Would
love to hear your thoughts about this.

Thanks & Regards,
Rahul Iyer


Flaky test issue report (51)

2022-02-24 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-13952: Dataflow streaming tests 
failing new AfterSynchronizedProcessingTime test (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13859: Test flake: 
test_split_half_sdf (created 2022-02-09)
https://issues.apache.org/jira/browse/BEAM-13858: Failure of 
:sdks:go:examples:wordCount in check "Mac run local environment shell script" 
(created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13850: 
beam_PostCommit_Python_Examples_Dataflow failing (created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13822: GBK and CoGBK streaming 
Java load tests failing (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13810: Flaky tests: Gradle build 
daemon disappeared unexpectedly (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13797: Flakes: Failed to load 
cache entry (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13783: 
apache_beam.transforms.combinefn_lifecycle_test.LocalCombineFnLifecycleTest.test_combine
 is flaky (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13741: 
:sdks:java:extensions:sql:hcatalog:compileJava failing in 
beam_Release_NightlySnapshot  (created 2022-01-25)
https://issues.apache.org/jira/browse/BEAM-13708: flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored (created 2022-01-20)
https://issues.apache.org/jira/browse/BEAM-13693: 
beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming timing out at 9 hours 
(created 2022-01-19)
https://issues.apache.org/jira/browse/BEAM-13575: Flink 
testParDoRequiresStableInput flaky (created 2021-12-28)
https://issues.apache.org/jira/browse/BEAM-13519: Java precommit flaky 
(timing out) (created 2021-12-22)
https://issues.apache.org/jira/browse/BEAM-13500: NPE in Flink Portable 
ValidatesRunner streaming suite (created 2021-12-21)
https://issues.apache.org/jira/browse/BEAM-13453: Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use 
(created 2021-12-13)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13367: 
[beam_PostCommit_Python36] [ 
apache_beam.io.gcp.experimental.spannerio_read_it_test] Failure summary 
(created 2021-12-01)
https://issues.apache.org/jira/browse/BEAM-13312: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
 is flaky in Java Spark ValidatesRunner suite  (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13311: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite. (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13234: Flake in 
StreamingWordCountIT.test_streaming_wordcount_it (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13025: pubsublite.ReadWriteIT 
flaky in beam_PostCommit_Java_DataflowV2   (created 2021-10-08)
https://issues.apache.org/jira/browse/BEAM-12928: beam_PostCommit_Python36 
- CrossLanguageSpannerIOTest - flakey failing (created 2021-09-21)
https://issues.apache.org/jira/browse/BEAM-12859: 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12858: 
org.apache.beam.sdk.io.gcp.datastore.RampupThrottlingFnTest.testRampupThrottler 
is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12809: 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky (created 2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12794: 
PortableRunnerTestWithExternalEnv.test_pardo_timers flaky (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12793: 
beam_PostRelease_NightlySnapshot failed (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12673: 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey (created 2021-07-28)
https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in PipelineOptionsTest.test_display_data (created 2021-06-18)
https://issues.apache.org/jira/browse/BEAM-12322: Python precommit flaky: 
Failed to read inputs in the data plane (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12320: 

P1 issues report (74)

2022-02-24 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-13995: Apache beam is having 
vulnerable dependencies - Tensorflow, httplib2, pandas and numpy (created 
2022-02-24)
https://issues.apache.org/jira/browse/BEAM-13990: BigQueryIO cannot write 
to DATE and TIMESTAMP columns when using Storage Write API  (created 2022-02-23)
https://issues.apache.org/jira/browse/BEAM-13980: S3 Client broken: missing 
get_object_metadata() (created 2022-02-22)
https://issues.apache.org/jira/browse/BEAM-13959: Unable to write to 
BigQuery tables with column named 'f' (created 2022-02-16)
https://issues.apache.org/jira/browse/BEAM-13952: Dataflow streaming tests 
failing new AfterSynchronizedProcessingTime test (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13950: PVR_Spark2_Streaming 
perma-red (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13920: Beam x-lang Dataflow 
tests failing due to _InactiveRpcError (created 2022-02-10)
https://issues.apache.org/jira/browse/BEAM-13858: Failure of 
:sdks:go:examples:wordCount in check "Mac run local environment shell script" 
(created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13850: 
beam_PostCommit_Python_Examples_Dataflow failing (created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13830: XVR Direct/Spark/Flink 
tests are timing out (created 2022-02-04)
https://issues.apache.org/jira/browse/BEAM-13822: GBK and CoGBK streaming 
Java load tests failing (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13809: beam_PostCommit_XVR_Flink 
flaky: Connection refused (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13805: Simplify version override 
for Dev versions of the Go SDK. (created 2022-02-02)
https://issues.apache.org/jira/browse/BEAM-13798: Upgrade Kubernetes 
Clusters (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13769: 
beam_PreCommit_Python_Cron failing on test_create_uses_coder_for_pickling 
(created 2022-01-28)
https://issues.apache.org/jira/browse/BEAM-13763: Rotate credentials for 
'io-datastores' Kubernetes cluster (created 2022-01-28)
https://issues.apache.org/jira/browse/BEAM-13741: 
:sdks:java:extensions:sql:hcatalog:compileJava failing in 
beam_Release_NightlySnapshot  (created 2022-01-25)
https://issues.apache.org/jira/browse/BEAM-13715: Kafka commit offset drop 
data on failure for runners that have non-checkpointing shuffle (created 
2022-01-21)
https://issues.apache.org/jira/browse/BEAM-13693: 
beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming timing out at 9 hours 
(created 2022-01-19)
https://issues.apache.org/jira/browse/BEAM-13582: Beam website precommit 
mentions broken links, but passes. (created 2021-12-30)
https://issues.apache.org/jira/browse/BEAM-13579: Cannot run 
python_xlang_kafka_taxi_dataflow validation script on 2.35.0 (created 
2021-12-29)
https://issues.apache.org/jira/browse/BEAM-13487: WriteToBigQuery Dynamic 
table destinations returns wrong tableId (created 2021-12-17)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13237: 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2 (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13164: Race between member 
variable being accessed due to leaking uninitialized state via 
OutboundObserverFactory (created 2021-11-01)
https://issues.apache.org/jira/browse/BEAM-13132: WriteToBigQuery submits a 
duplicate BQ load job if a 503 error code is returned from googleapi (created 
2021-10-27)
https://issues.apache.org/jira/browse/BEAM-13087: 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible 
(created 2021-10-20)
https://issues.apache.org/jira/browse/BEAM-13078: Python DirectRunner does 
not emit data at GC time (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13076: Python AfterAny, AfterAll 
do not follow spec (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13010: Delete orphaned files 
(created 2021-10-06)
https://issues.apache.org/jira/browse/BEAM-12995: Consumer group with 
random prefix (created 2021-10-04)
https://issues.apache.org/jira/browse/BEAM-12959: Dataflow error in 
CombinePerKey operation (created 2021-09-26)

Re: [RFC][Design] Bundle Finalization in the Go Sdk

2022-02-24 Thread Robert Burke
Thanks Danny!

I did a first pass on comments, but do like the approach. Needs some
justifications on why this path should be chosen over alternative
implementations.


On Thu, Feb 24, 2022, 7:42 AM Danny McCormick 
wrote:

> Hey everyone, I put together a design doc for adding Bundle Finalization
> in the Go Sdk and would appreciate any thoughts you have!
>
> Bundle finalization enables a DoFn to perform side effects after a runner
> has acknowledged that it has durably persisted the output. Right now, Java
> and Python support bundle finalization by allowing a user to register a
> callback function which is invoked when the runner acknowledges that it has
> persisted all output, but go does not have any such support. This proposes
> an approach for adding bundle finalization in the Go Sdk by allowing users
> to register callbacks during their normal DoFn execution. It generally
> mirrors the behavior adopted by other languages and allows users to
> dynamically specify their callbacks at runtime.
>
> Please share any feedback or support here:
> https://docs.google.com/document/d/1dLylt36oFhsWfyBaqPayYXqYHCICNrSZ6jmr51eqZ4k/edit?usp=sharing
>
> Thanks,
> Danny
>


Re: [Proposal] => JMSIO dynamic topic publishing

2022-02-24 Thread Jean-Baptiste Onofré
All JMS related properties can be null (@Nullable).

Good point about breaking change.

The part on which I'm not big fan in your proposal is this
getDynamic() method. Maybe we can mimic what I did in JdbcIO with a fn
we can inject to define the destination.

But, ok, I would be happy to review a PR (I can comment on the PR though).

Regards
JB

On Thu, Feb 24, 2022 at 5:16 PM BALLADA Vincent
 wrote:
>
> Hi Jean Baptiste 
>
>
>
> Thank you for your feedback, it is interesting.
>
>
>
> In your proposal, the write transform would take a PCollection< JmsRecord>.
>
> JmsRecord has a lot of property that are related to the read operation 
> (jmsRedelivered, correlationId, etc…)
>
> As a result I think it won’t be easy to create it with all relevant 
> properties at late stage of the pipeline, before the write operation.
>
> It would introduce also a breaking change as the current implementation takes 
> a Pcollection as input.
>
> Plus the new standard for IO implementation seems to be parameterized 
> read/write (see annotated document).
>
>
>
> Thanks,
>
>
>
> Regards
>
>
>
> Vincent BALLADA
>
>
>
> De : Jean-Baptiste Onofré 
> Date : jeudi, 24 février 2022 à 16:32
> À : dev@beam.apache.org 
> Objet : Re: [Proposal] => JMSIO dynamic topic publishing
>
> [EXTERNAL EMAIL : Be CAUTIOUS, especially with links and attachments]
>
> Hi Vincent,
>
> It's Jean-Baptiste, not Jean-François, but it works as well ;)
>
> I got your point, however, I think we can achieve the same with
> JmsRecord. You can always use a DoFn at any part of your pipeline
> (after reading from kafka, redis, whatever), where you can create
> JmsRecord. If JmsRecord contains a property with destination, it can
> be populated dynamically JmsDestination.
>
> So basically, it means that we can add JmsDestinationType on the
> JmsRecord and use it in the sink.
>
> Something lie:
>
> pipeline
>  .apply(...) // returns PCollection // JmsReccord can be
> created/populated with destination and destination type
> .apply(JmsIO.write().withConnectionFactory(myConnectionFactory))
>
> I think it's less invasive.
>
> It's basically the same that you propose, but reusing JmsRecord and
> avoiding getDynamic() hook.
>
> Again, I'm not against with your proposal, but we can have something
> more concise.
>
> Just my €0.01 ;)
>
> Regards
> JB
>
> On Thu, Feb 24, 2022 at 3:42 PM BALLADA Vincent
>  wrote:
> >
> > Hi Jean François
> >
> >
> >
> > Please to hear of the author of JmsIO .
> >
> > Many thanks for your suggestion.
> >
> > JmsRecord is used at read time, in most use cases we will use a mapper to 
> > provide an object that will be used in several transform in the pipeline.
> >
> > The destination won’t necessary be included in the read JmsRecord, as for 
> > instance in many pipelines we do data enrichment (form redis, database, 
> > etc…) and the dynamic part of the topic would be extracted at any stage of 
> > the pipeline.
> >
> > So for me it would be more flexible to not be tied with JmsRecord.
> >
> >
> >
> > Regards
> >
> >
> >
> > Vincent BALLADA
> >
> >
> >
> > De : Jean-Baptiste Onofré 
> > Date : samedi, 19 février 2022 à 06:40
> > À : dev@beam.apache.org 
> > Objet : Re: [Proposal] => JMSIO dynamic topic publishing
> >
> > [EXTERNAL EMAIL : Be CAUTIOUS, especially with links and attachments]
> >
> > Hi Vincent,
> >
> > It looks interesting. Another possible approach is to have some
> > implicit (instead of being explicit) but defining the destination on
> > the JmsRecord. If the JmsRecord contains the destination (that could
> > be "dynamic"), we use it, overriding the destination provided on the
> > IO configuration.
> > Thoughts ?
> >
> > Regards
> > JB
> >
> > NB: I'm the original author of JmsIO ;)
> >
> > On Fri, Feb 18, 2022 at 7:00 PM BALLADA Vincent
> >  wrote:
> > >
> > > Hi all
> > >
> > >
> > >
> > > Here is a proposal to implement the ability to publish on dynamic topics 
> > > with JMSIO:
> > >
> > > https://docs.google.com/document/d/1IY4_e5g1g71XvTLL4slHRyVfX7ByiwjD_de3WGsBQXg/edit?usp=sharing
> > >
> > >
> > >
> > > There is also a JIRA issue:
> > >
> > > https://issues.apache.org/jira/browse/BEAM-13608
> > >
> > >
> > >
> > > Best regards
> > >
> > >
> > >
> > > Vincent BALLADA
> > >
> > >
> > > Confidential C
> > >
> > > -- Disclaimer 
> > > Ce message ainsi que les eventuelles pieces jointes constituent une 
> > > correspondance privee et confidentielle a l'attention exclusive du 
> > > destinataire designe ci-dessus. Si vous n'etes pas le destinataire du 
> > > present message ou une personne susceptible de pouvoir le lui delivrer, 
> > > il vous est signifie que toute divulgation, distribution ou copie de 
> > > cette transmission est strictement interdite. Si vous avez recu ce 
> > > message par erreur, nous vous remercions d'en informer l'expediteur par 
> > > telephone ou de lui retourner le present message, puis d'effacer 
> > > immediatement ce message de votre systeme.
> > >
> > > 

Re: [Proposal] => JMSIO dynamic topic publishing

2022-02-24 Thread BALLADA Vincent
Hi Jean Baptiste 

Thank you for your feedback, it is interesting.

In your proposal, the write transform would take a PCollection< JmsRecord>.
JmsRecord has a lot of property that are related to the read operation 
(jmsRedelivered, correlationId, etc…)
As a result I think it won’t be easy to create it with all relevant properties 
at late stage of the pipeline, before the write operation.
It would introduce also a breaking change as the current implementation takes a 
Pcollection as input.
Plus the new standard for IO implementation seems to be parameterized 
read/write (see annotated document).

Thanks,

Regards

Vincent BALLADA

De : Jean-Baptiste Onofré 
Date : jeudi, 24 février 2022 à 16:32
À : dev@beam.apache.org 
Objet : Re: [Proposal] => JMSIO dynamic topic publishing
[EXTERNAL EMAIL : Be CAUTIOUS, especially with links and attachments]

Hi Vincent,

It's Jean-Baptiste, not Jean-François, but it works as well ;)

I got your point, however, I think we can achieve the same with
JmsRecord. You can always use a DoFn at any part of your pipeline
(after reading from kafka, redis, whatever), where you can create
JmsRecord. If JmsRecord contains a property with destination, it can
be populated dynamically JmsDestination.

So basically, it means that we can add JmsDestinationType on the
JmsRecord and use it in the sink.

Something lie:

pipeline
 .apply(...) // returns PCollection // JmsReccord can be
created/populated with destination and destination type
.apply(JmsIO.write().withConnectionFactory(myConnectionFactory))

I think it's less invasive.

It's basically the same that you propose, but reusing JmsRecord and
avoiding getDynamic() hook.

Again, I'm not against with your proposal, but we can have something
more concise.

Just my €0.01 ;)

Regards
JB

On Thu, Feb 24, 2022 at 3:42 PM BALLADA Vincent
 wrote:
>
> Hi Jean François
>
>
>
> Please to hear of the author of JmsIO .
>
> Many thanks for your suggestion.
>
> JmsRecord is used at read time, in most use cases we will use a mapper to 
> provide an object that will be used in several transform in the pipeline.
>
> The destination won’t necessary be included in the read JmsRecord, as for 
> instance in many pipelines we do data enrichment (form redis, database, etc…) 
> and the dynamic part of the topic would be extracted at any stage of the 
> pipeline.
>
> So for me it would be more flexible to not be tied with JmsRecord.
>
>
>
> Regards
>
>
>
> Vincent BALLADA
>
>
>
> De : Jean-Baptiste Onofré 
> Date : samedi, 19 février 2022 à 06:40
> À : dev@beam.apache.org 
> Objet : Re: [Proposal] => JMSIO dynamic topic publishing
>
> [EXTERNAL EMAIL : Be CAUTIOUS, especially with links and attachments]
>
> Hi Vincent,
>
> It looks interesting. Another possible approach is to have some
> implicit (instead of being explicit) but defining the destination on
> the JmsRecord. If the JmsRecord contains the destination (that could
> be "dynamic"), we use it, overriding the destination provided on the
> IO configuration.
> Thoughts ?
>
> Regards
> JB
>
> NB: I'm the original author of JmsIO ;)
>
> On Fri, Feb 18, 2022 at 7:00 PM BALLADA Vincent
>  wrote:
> >
> > Hi all
> >
> >
> >
> > Here is a proposal to implement the ability to publish on dynamic topics 
> > with JMSIO:
> >
> > https://docs.google.com/document/d/1IY4_e5g1g71XvTLL4slHRyVfX7ByiwjD_de3WGsBQXg/edit?usp=sharing
> >
> >
> >
> > There is also a JIRA issue:
> >
> > https://issues.apache.org/jira/browse/BEAM-13608
> >
> >
> >
> > Best regards
> >
> >
> >
> > Vincent BALLADA
> >
> >
> > Confidential C
> >
> > -- Disclaimer 
> > Ce message ainsi que les eventuelles pieces jointes constituent une 
> > correspondance privee et confidentielle a l'attention exclusive du 
> > destinataire designe ci-dessus. Si vous n'etes pas le destinataire du 
> > present message ou une personne susceptible de pouvoir le lui delivrer, il 
> > vous est signifie que toute divulgation, distribution ou copie de cette 
> > transmission est strictement interdite. Si vous avez recu ce message par 
> > erreur, nous vous remercions d'en informer l'expediteur par telephone ou de 
> > lui retourner le present message, puis d'effacer immediatement ce message 
> > de votre systeme.
> >
> > *** This e-mail and any attachments is a confidential correspondence 
> > intended only for use of the individual or entity named above. If you are 
> > not the intended recipient or the agent responsible for delivering the 
> > message to the intended recipient, you are hereby notified that any 
> > disclosure, distribution or copying of this communication is strictly 
> > prohibited. If you have received this communication in error, please notify 
> > the sender by phone or by replying this message, and then delete this 
> > message from your system.
>
> 
>
>
> Confidential C
>
> -- Disclaimer 
> Ce message ainsi que les eventuelles pieces jointes constituent une 
> 

Re: [Proposal] => JMSIO dynamic topic publishing

2022-02-24 Thread Jean-Baptiste Onofré
Hi Vincent,

It's Jean-Baptiste, not Jean-François, but it works as well ;)

I got your point, however, I think we can achieve the same with
JmsRecord. You can always use a DoFn at any part of your pipeline
(after reading from kafka, redis, whatever), where you can create
JmsRecord. If JmsRecord contains a property with destination, it can
be populated dynamically JmsDestination.

So basically, it means that we can add JmsDestinationType on the
JmsRecord and use it in the sink.

Something lie:

pipeline
 .apply(...) // returns PCollection // JmsReccord can be
created/populated with destination and destination type
.apply(JmsIO.write().withConnectionFactory(myConnectionFactory))

I think it's less invasive.

It's basically the same that you propose, but reusing JmsRecord and
avoiding getDynamic() hook.

Again, I'm not against with your proposal, but we can have something
more concise.

Just my €0.01 ;)

Regards
JB

On Thu, Feb 24, 2022 at 3:42 PM BALLADA Vincent
 wrote:
>
> Hi Jean François
>
>
>
> Please to hear of the author of JmsIO .
>
> Many thanks for your suggestion.
>
> JmsRecord is used at read time, in most use cases we will use a mapper to 
> provide an object that will be used in several transform in the pipeline.
>
> The destination won’t necessary be included in the read JmsRecord, as for 
> instance in many pipelines we do data enrichment (form redis, database, etc…) 
> and the dynamic part of the topic would be extracted at any stage of the 
> pipeline.
>
> So for me it would be more flexible to not be tied with JmsRecord.
>
>
>
> Regards
>
>
>
> Vincent BALLADA
>
>
>
> De : Jean-Baptiste Onofré 
> Date : samedi, 19 février 2022 à 06:40
> À : dev@beam.apache.org 
> Objet : Re: [Proposal] => JMSIO dynamic topic publishing
>
> [EXTERNAL EMAIL : Be CAUTIOUS, especially with links and attachments]
>
> Hi Vincent,
>
> It looks interesting. Another possible approach is to have some
> implicit (instead of being explicit) but defining the destination on
> the JmsRecord. If the JmsRecord contains the destination (that could
> be "dynamic"), we use it, overriding the destination provided on the
> IO configuration.
> Thoughts ?
>
> Regards
> JB
>
> NB: I'm the original author of JmsIO ;)
>
> On Fri, Feb 18, 2022 at 7:00 PM BALLADA Vincent
>  wrote:
> >
> > Hi all
> >
> >
> >
> > Here is a proposal to implement the ability to publish on dynamic topics 
> > with JMSIO:
> >
> > https://docs.google.com/document/d/1IY4_e5g1g71XvTLL4slHRyVfX7ByiwjD_de3WGsBQXg/edit?usp=sharing
> >
> >
> >
> > There is also a JIRA issue:
> >
> > https://issues.apache.org/jira/browse/BEAM-13608
> >
> >
> >
> > Best regards
> >
> >
> >
> > Vincent BALLADA
> >
> >
> > Confidential C
> >
> > -- Disclaimer 
> > Ce message ainsi que les eventuelles pieces jointes constituent une 
> > correspondance privee et confidentielle a l'attention exclusive du 
> > destinataire designe ci-dessus. Si vous n'etes pas le destinataire du 
> > present message ou une personne susceptible de pouvoir le lui delivrer, il 
> > vous est signifie que toute divulgation, distribution ou copie de cette 
> > transmission est strictement interdite. Si vous avez recu ce message par 
> > erreur, nous vous remercions d'en informer l'expediteur par telephone ou de 
> > lui retourner le present message, puis d'effacer immediatement ce message 
> > de votre systeme.
> >
> > *** This e-mail and any attachments is a confidential correspondence 
> > intended only for use of the individual or entity named above. If you are 
> > not the intended recipient or the agent responsible for delivering the 
> > message to the intended recipient, you are hereby notified that any 
> > disclosure, distribution or copying of this communication is strictly 
> > prohibited. If you have received this communication in error, please notify 
> > the sender by phone or by replying this message, and then delete this 
> > message from your system.
>
> 
>
>
> Confidential C
>
> -- Disclaimer 
> Ce message ainsi que les eventuelles pieces jointes constituent une 
> correspondance privee et confidentielle a l'attention exclusive du 
> destinataire designe ci-dessus. Si vous n'etes pas le destinataire du present 
> message ou une personne susceptible de pouvoir le lui delivrer, il vous est 
> signifie que toute divulgation, distribution ou copie de cette transmission 
> est strictement interdite. Si vous avez recu ce message par erreur, nous vous 
> remercions d'en informer l'expediteur par telephone ou de lui retourner le 
> present message, puis d'effacer immediatement ce message de votre systeme.
>
> *** This e-mail and any attachments is a confidential correspondence intended 
> only for use of the individual or entity named above. If you are not the 
> intended recipient or the agent responsible for delivering the message to the 
> intended recipient, you are hereby notified that any disclosure, distribution 
> or 

Re: [Proposal] => JMSIO dynamic topic publishing

2022-02-24 Thread BALLADA Vincent
Hi Jean François

Please to hear of the author of JmsIO .
Many thanks for your suggestion.
JmsRecord is used at read time, in most use cases we will use a mapper to 
provide an object that will be used in several transform in the pipeline.
The destination won’t necessary be included in the read JmsRecord, as for 
instance in many pipelines we do data enrichment (form redis, database, etc…) 
and the dynamic part of the topic would be extracted at any stage of the 
pipeline.
So for me it would be more flexible to not be tied with JmsRecord.

Regards

Vincent BALLADA

De : Jean-Baptiste Onofré 
Date : samedi, 19 février 2022 à 06:40
À : dev@beam.apache.org 
Objet : Re: [Proposal] => JMSIO dynamic topic publishing
[EXTERNAL EMAIL : Be CAUTIOUS, especially with links and attachments]

Hi Vincent,

It looks interesting. Another possible approach is to have some
implicit (instead of being explicit) but defining the destination on
the JmsRecord. If the JmsRecord contains the destination (that could
be "dynamic"), we use it, overriding the destination provided on the
IO configuration.
Thoughts ?

Regards
JB

NB: I'm the original author of JmsIO ;)

On Fri, Feb 18, 2022 at 7:00 PM BALLADA Vincent
 wrote:
>
> Hi all
>
>
>
> Here is a proposal to implement the ability to publish on dynamic topics with 
> JMSIO:
>
> https://docs.google.com/document/d/1IY4_e5g1g71XvTLL4slHRyVfX7ByiwjD_de3WGsBQXg/edit?usp=sharing
>
>
>
> There is also a JIRA issue:
>
> https://issues.apache.org/jira/browse/BEAM-13608
>
>
>
> Best regards
>
>
>
> Vincent BALLADA
>
>
> Confidential C
>
> -- Disclaimer 
> Ce message ainsi que les eventuelles pieces jointes constituent une 
> correspondance privee et confidentielle a l'attention exclusive du 
> destinataire designe ci-dessus. Si vous n'etes pas le destinataire du present 
> message ou une personne susceptible de pouvoir le lui delivrer, il vous est 
> signifie que toute divulgation, distribution ou copie de cette transmission 
> est strictement interdite. Si vous avez recu ce message par erreur, nous vous 
> remercions d'en informer l'expediteur par telephone ou de lui retourner le 
> present message, puis d'effacer immediatement ce message de votre systeme.
>
> *** This e-mail and any attachments is a confidential correspondence intended 
> only for use of the individual or entity named above. If you are not the 
> intended recipient or the agent responsible for delivering the message to the 
> intended recipient, you are hereby notified that any disclosure, distribution 
> or copying of this communication is strictly prohibited. If you have received 
> this communication in error, please notify the sender by phone or by replying 
> this message, and then delete this message from your system.




Confidential C
-- Disclaimer  
Ce message ainsi que les eventuelles pieces jointes constituent une 
correspondance privee et confidentielle a l'attention exclusive du destinataire 
designe ci-dessus. Si vous n'etes pas le destinataire du present message ou une 
personne susceptible de pouvoir le lui delivrer, il vous est signifie que toute 
divulgation, distribution ou copie de cette transmission est strictement 
interdite. Si vous avez recu ce message par erreur, nous vous remercions d'en 
informer l'expediteur par telephone ou de lui retourner le present message, 
puis d'effacer immediatement ce message de votre systeme.

*** This e-mail and any attachments is a confidential correspondence intended 
only for use of the individual or entity named above. If you are not the 
intended recipient or the agent responsible for delivering the message to the 
intended recipient, you are hereby notified that any disclosure, distribution 
or copying of this communication is strictly prohibited. If you have received 
this communication in error, please notify the sender by phone or by replying 
this message, and then delete this message from your system.