Hi Ahmet/Udi,

There are a couple of additional tests that failed following my change; 
apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_strip_success
 and 
apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_attributes_success,
 as the added message_id in the attributes is not expected; I can fix these 
quite easily, but I'm not clear on what these tests are for, and what the 
pubsub_matcher is actually supposed to be testing for, so my aribitrary fix to 
add the expectation of message_id into the attributes property seems odd.

In terms of testing with gcp dataflow runner, I'm struggling a little. I've 
followed the guidance on the Python Tips page:-

I've built the tarball:-

python setup.py sdist

Then run the following:-

# Build portable worker
./gradlew :runners:google-cloud-dataflow-java:worker:build -x spotlessJava -x 
rat -x test
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar

# Build portable Pyhon SDK harness and publish it to GCP
./gradlew -Pdocker-repository-root=gcr.io/dataflow-build/$USER/beam -p 
sdks/python/container docker
gcloud docker -- push gcr.io/dataflow-build/$USER/beam/python:latest

Then using my simple test pipeline to insert into a big query table run the 
following:-

python ~/github/Beam-Testing/PythonTestMessageId.py --runner DataflowRunner 
--project [MY-Project] --input_subscription 
projects/[MY-Project]/subscriptions/test-apache-beam.subscription 
--output_table [MY-Project]:test.dataflowrunner --temp_location 
gs://[MY-Project]/tmp --worker_harness_container_image 
gcr.io/dataflow-build/$USER/beam/python:latest --experiment beam_fn_api 
--sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --dataflow_worker_jar 
'../../runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.15.0-SNAPSHOT.jar'
 --debug

The job starts ok, but when I publish messages, nothing is being read from the 
queue. I'm assuming this might be to do with the changes made, and whilst it 
works with the directrunner this is not the case with the dataflow runner; but 
I'm not receiving any errors in the debug stream on my console, only a series 
of info around the worker configuration.

This may well be me doing stuff wrong, so apologies if I'm being thick here! 
Have I missed some steps in the build?

Regards

Matthew

-----Original Message-----
From: Ahmet Altay 
<[email protected]<mailto:ahmet%20altay%20%[email protected]%3e>>
Reply-To: [email protected]<mailto:[email protected]>
To: dev <[email protected]<mailto:dev%20%[email protected]%3e>>, Udi 
Meiri <[email protected]<mailto:udi%20meiri%20%[email protected]%3e>>
Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to 
the PubSubMessage attributes property
Date: Mon, 29 Jul 2019 09:50:14 -0700

This message originated from outside your organization
________________________________
Hi Matthew,

This looks like a reasonable approach. There is a difference how direct runner 
reads from pubsub compared to other runners. As you convert to a PR, please pay 
attention to the difference and add tests for both cases.

On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin 
<[email protected]<mailto:[email protected]>> 
wrote:

Hi All,


This is my first attempt at a change for Beam on 
https://issues.apache.org/jira/browse/BEAM-7819.  This parses the message_id 
when reading from the PubSub protobuf and adds to the message_id, as suggested 
by the existing documentation - 
https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes


My code is currently checked into my fork:- 
https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id


Prior to creating a pull request, I just wanted to check the approach is 
sensible and then to go through the process for pull request to ensure it goes 
smoothly.  I do wonder if long term despite the documentation indicated this 
would be part of the attributes property it might be more sensible to fully 
mimic the protobuf for the PubSub message.


I agree that making it similar to the protobuf sounds like a more sensible 
approach. Take a look at how it is handled in Java. There is also value in 
consistency across languages. Final decision could be made in the PR.

Ahmet

/cc +Udi Meiri<mailto:[email protected]>


Kind regards

Matthew

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

Reply via email to