So I wasn't paying attention to the message after trying to upload the python docker image (permissions issue) which meant when I ran the dataflow, it couldn't find the docker image. Digging in the stackdriver logs indicated this, and the job was just stuck in a loop waiting for the image. Oops, need to learn to read!
Rectified that, however realised I hadn't changed the _from_protobuf function, which is handled differently, so will take a look at dealing with that. -----Original Message----- From: Ahmet Altay <[email protected]<mailto:ahmet%20altay%20%[email protected]%3e>> Reply-To: [email protected]<mailto:[email protected]> To: dev <[email protected]<mailto:dev%20%[email protected]%3e>>, Mark Liu <[email protected]<mailto:mark%20liu%20%[email protected]%3e>> Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property Date: Fri, 02 Aug 2019 18:55:57 -0700 This message originated from outside your organization ________________________________ On Wed, Jul 31, 2019 at 4:19 AM Matthew Darwin <[email protected]<mailto:[email protected]>> wrote: Hi Ahmet/Udi, There are a couple of additional tests that failed following my change; apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_strip_success and apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_attributes_success, as the added message_id in the attributes is not expected; I can fix these quite easily, but I'm not clear on what these tests are for, and what the pubsub_matcher is actually supposed to be testing for, so my aribitrary fix to add the expectation of message_id into the attributes property seems odd. If I remember correctly, pubsub matcher was utility of the test framework. Its purpose is to make it easier to write tests that use pubsub as an output and verify the outputs of those tets. /cc +Mark Liu<mailto:[email protected]> to correct me on this. In terms of testing with gcp dataflow runner, I'm struggling a little. I've followed the guidance on the Python Tips page:- I've built the tarball:- python setup.py sdist Then run the following:- # Build portable worker ./gradlew :runners:google-cloud-dataflow-java:worker:build -x spotlessJava -x rat -x test ./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar # Build portable Pyhon SDK harness and publish it to GCP ./gradlew -Pdocker-repository-root=gcr.io<http://gcr.io>/dataflow-build/$USER/beam -p sdks/python/container docker gcloud docker -- push gcr.io<http://gcr.io>/dataflow-build/$USER/beam/python:latest Then using my simple test pipeline to insert into a big query table run the following:- python ~/github/Beam-Testing/PythonTestMessageId.py --runner DataflowRunner --project [MY-Project] --input_subscription projects/[MY-Project]/subscriptions/test-apache-beam.subscription --output_table [MY-Project]:test.dataflowrunner --temp_location gs://[MY-Project]/tmp --worker_harness_container_image gcr.io/dataflow-build/$USER/beam/python:latest<http://gcr.io/dataflow-build/$USER/beam/python:latest> --experiment beam_fn_api --sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --dataflow_worker_jar '../../runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.15.0-SNAPSHOT.jar' --debug It is hard for us to debug this without knowing what happened with the Dataflow jobs. Another way to verify this could be running one of the existing tests on your PR. You can try [1] by commenting "Run Python Dataflow ValidatesRunner" on your PR. [1] https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PostCommit_Python_ValidatesRunner_Dataflow.groovy The job starts ok, but when I publish messages, nothing is being read from the queue. I'm assuming this might be to do with the changes made, and whilst it works with the directrunner this is not the case with the dataflow runner; but I'm not receiving any errors in the debug stream on my console, only a series of info around the worker configuration. This may well be me doing stuff wrong, so apologies if I'm being thick here! Have I missed some steps in the build? Regards Matthew -----Original Message----- From: Ahmet Altay <[email protected]<mailto:ahmet%20altay%20%[email protected]%3e>> Reply-To: [email protected]<mailto:[email protected]> To: dev <[email protected]<mailto:dev%20%[email protected]%3e>>, Udi Meiri <[email protected]<mailto:udi%20meiri%20%[email protected]%3e>> Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property Date: Mon, 29 Jul 2019 09:50:14 -0700 This message originated from outside your organization ________________________________ Hi Matthew, This looks like a reasonable approach. There is a difference how direct runner reads from pubsub compared to other runners. As you convert to a PR, please pay attention to the difference and add tests for both cases. On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin <[email protected]<mailto:[email protected]>> wrote: Hi All, This is my first attempt at a change for Beam on https://issues.apache.org/jira/browse/BEAM-7819. This parses the message_id when reading from the PubSub protobuf and adds to the message_id, as suggested by the existing documentation - https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes My code is currently checked into my fork:- https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id Prior to creating a pull request, I just wanted to check the approach is sensible and then to go through the process for pull request to ensure it goes smoothly. I do wonder if long term despite the documentation indicated this would be part of the attributes property it might be more sensible to fully mimic the protobuf for the PubSub message. I agree that making it similar to the protobuf sounds like a more sensible approach. Take a look at how it is handled in Java. There is also value in consistency across languages. Final decision could be made in the PR. Ahmet /cc +Udi Meiri<mailto:[email protected]> Kind regards Matthew Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.
