Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Matthew Darwin Sat, 03 Aug 2019 07:11:41 -0700

So I wasn't paying attention to the message after trying to upload the python 
docker image (permissions issue) which meant when I ran the dataflow, it 
couldn't find the docker image. Digging in the stackdriver logs indicated this, 
and the job was just stuck in a loop waiting for the image. Oops, need to learn 
to read!

Rectified that, however realised I hadn't changed the _from_protobuf function, 
which is handled differently, so will take a look at dealing with that.

-----Original Message-----
From: Ahmet Altay 
<[email protected]<mailto:ahmet%20altay%20%[email protected]%3e>>
Reply-To: [email protected]<mailto:[email protected]>
To: dev <[email protected]<mailto:dev%20%[email protected]%3e>>, Mark Liu 
<[email protected]<mailto:mark%20liu%20%[email protected]%3e>>
Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to 
the PubSubMessage attributes property
Date: Fri, 02 Aug 2019 18:55:57 -0700

This message originated from outside your organization
________________________________
On Wed, Jul 31, 2019 at 4:19 AM Matthew Darwin 
<[email protected]<mailto:[email protected]>> 
wrote:
Hi Ahmet/Udi,

There are a couple of additional tests that failed following my change; 
apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_strip_success
 and 
apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_attributes_success,
 as the added message_id in the attributes is not expected; I can fix these 
quite easily, but I'm not clear on what these tests are for, and what the 
pubsub_matcher is actually supposed to be testing for, so my aribitrary fix to 
add the expectation of message_id into the attributes property seems odd.

If I remember correctly, pubsub matcher was utility of the test framework. Its 
purpose is to make it easier to write tests that use pubsub as an output and 
verify the outputs of those tets.

/cc +Mark Liu<mailto:[email protected]> to correct me on this.

In terms of testing with gcp dataflow runner, I'm struggling a little. I've 
followed the guidance on the Python Tips page:-

I've built the tarball:-

python setup.py sdist

Then run the following:-

# Build portable worker
./gradlew :runners:google-cloud-dataflow-java:worker:build -x spotlessJava -x 
rat -x test
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar

# Build portable Pyhon SDK harness and publish it to GCP
./gradlew 
-Pdocker-repository-root=gcr.io<http://gcr.io>/dataflow-build/$USER/beam -p 
sdks/python/container docker
gcloud docker -- push 
gcr.io<http://gcr.io>/dataflow-build/$USER/beam/python:latest

Then using my simple test pipeline to insert into a big query table run the 
following:-

python ~/github/Beam-Testing/PythonTestMessageId.py --runner DataflowRunner 
--project [MY-Project] --input_subscription 
projects/[MY-Project]/subscriptions/test-apache-beam.subscription 
--output_table [MY-Project]:test.dataflowrunner --temp_location 
gs://[MY-Project]/tmp --worker_harness_container_image 
gcr.io/dataflow-build/$USER/beam/python:latest<http://gcr.io/dataflow-build/$USER/beam/python:latest>
 --experiment beam_fn_api --sdk_location dist/apache-beam-2.15.0.dev0.tar.gz 
--dataflow_worker_jar 
'../../runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.15.0-SNAPSHOT.jar'
 --debug

It is hard for us to debug this without knowing what happened with the Dataflow 
jobs. Another way to verify this could be running one of the existing tests on 
your PR. You can try [1] by commenting "Run Python Dataflow ValidatesRunner" on 
your PR.

[1] 
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PostCommit_Python_ValidatesRunner_Dataflow.groovy

The job starts ok, but when I publish messages, nothing is being read from the 
queue. I'm assuming this might be to do with the changes made, and whilst it 
works with the directrunner this is not the case with the dataflow runner; but 
I'm not receiving any errors in the debug stream on my console, only a series 
of info around the worker configuration.

This may well be me doing stuff wrong, so apologies if I'm being thick here! 
Have I missed some steps in the build?

Regards

Matthew

-----Original Message-----
From: Ahmet Altay 
<[email protected]<mailto:ahmet%20altay%20%[email protected]%3e>>
Reply-To: [email protected]<mailto:[email protected]>
To: dev <[email protected]<mailto:dev%20%[email protected]%3e>>, Udi 
Meiri <[email protected]<mailto:udi%20meiri%20%[email protected]%3e>>
Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to 
the PubSubMessage attributes property
Date: Mon, 29 Jul 2019 09:50:14 -0700

This message originated from outside your organization
________________________________
Hi Matthew,

This looks like a reasonable approach. There is a difference how direct runner 
reads from pubsub compared to other runners. As you convert to a PR, please pay 
attention to the difference and add tests for both cases.

On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin 
<[email protected]<mailto:[email protected]>> 
wrote:

Hi All,

This is my first attempt at a change for Beam on 
https://issues.apache.org/jira/browse/BEAM-7819.  This parses the message_id 
when reading from the PubSub protobuf and adds to the message_id, as suggested 
by the existing documentation - 
https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes

My code is currently checked into my fork:- 
https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id

Prior to creating a pull request, I just wanted to check the approach is 
sensible and then to go through the process for pull request to ensure it goes 
smoothly.  I do wonder if long term despite the documentation indicated this 
would be part of the attributes property it might be more sensible to fully 
mimic the protobuf for the PubSub message.

I agree that making it similar to the protobuf sounds like a more sensible 
approach. Take a look at how it is handled in Java. There is also value in 
consistency across languages. Final decision could be made in the PR.

Ahmet

/cc +Udi Meiri<mailto:[email protected]>

Kind regards

Matthew

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Reply via email to