Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Ahmet Altay Fri, 02 Aug 2019 18:56:42 -0700

On Wed, Jul 31, 2019 at 4:19 AM Matthew Darwin <
[email protected]> wrote:


> Hi Ahmet/Udi,
>
> There are a couple of additional tests that failed following my change;
> apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_strip_success
> and
> apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_attributes_success,
> as the added message_id in the attributes is not expected; I can fix these
> quite easily, but I'm not clear on what these tests are for, and what the
> pubsub_matcher is actually supposed to be testing for, so my aribitrary fix
> to add the expectation of message_id into the attributes property seems odd.
>

If I remember correctly, pubsub matcher was utility of the test framework.
Its purpose is to make it easier to write tests that use pubsub as an
output and verify the outputs of those tets.

/cc +Mark Liu <[email protected]> to correct me on this.


>
> In terms of testing with gcp dataflow runner, I'm struggling a little.
> I've followed the guidance on the Python Tips page:-
>
> I've built the tarball:-
>
> python setup.py sdist
>
> Then run the following:-
>
> # Build portable worker
> ./gradlew :runners:google-cloud-dataflow-java:worker:build -x
> spotlessJava -x rat -x test
> ./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar
>
> # Build portable Pyhon SDK harness and publish it to GCP
> ./gradlew -Pdocker-repository-root=gcr.io/dataflow-build/$USER/beam -p
> sdks/python/container docker
> gcloud docker -- push gcr.io/dataflow-build/$USER/beam/python:latest
>
> Then using my simple test pipeline to insert into a big query table run
> the following:-
>
> python ~/github/Beam-Testing/PythonTestMessageId.py --runner
> DataflowRunner --project [MY-Project] --input_subscription
> projects/[MY-Project]/subscriptions/test-apache-beam.subscription
> --output_table [MY-Project]:test.dataflowrunner --temp_location
> gs://[MY-Project]/tmp --worker_harness_container_image
> gcr.io/dataflow-build/$USER/beam/python:latest --experiment beam_fn_api
> --sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --dataflow_worker_jar
> '../../runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.15.0-SNAPSHOT.jar'
> --debug
>
>
It is hard for us to debug this without knowing what happened with the
Dataflow jobs. Another way to verify this could be running one of the
existing tests on your PR. You can try [1] by commenting "Run Python
Dataflow ValidatesRunner" on your PR.

[1]
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PostCommit_Python_ValidatesRunner_Dataflow.groovy


> The job starts ok, but when I publish messages, nothing is being read from
> the queue. I'm assuming this might be to do with the changes made, and
> whilst it works with the directrunner this is not the case with the
> dataflow runner; but I'm not receiving any errors in the debug stream on my
> console, only a series of info around the worker configuration.
>
> This may well be me doing stuff wrong, so apologies if I'm being thick
> here! Have I missed some steps in the build?
>
> Regards
>
> Matthew
>
> -----Original Message-----
> *From*: Ahmet Altay <[email protected]
> <ahmet%20altay%20%[email protected]%3e>>
> *Reply-To*: [email protected]
> *To*: dev <[email protected] <dev%20%[email protected]%3e>>, Udi
> Meiri <[email protected] <udi%20meiri%20%[email protected]%3e>>
> *Subject*: Re: [BEAM-7819] -python - parsing message_id from PubSub
> message to the PubSubMessage attributes property
> *Date*: Mon, 29 Jul 2019 09:50:14 -0700
>
> *This message originated from outside your organization*
> ------------------------------
> Hi Matthew,
>
> This looks like a reasonable approach. There is a difference how direct
> runner reads from pubsub compared to other runners. As you convert to a PR,
> please pay attention to the difference and add tests for both cases.
>
> On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin <
> [email protected]> wrote:
>
> Hi All,
>
>
> This is my first attempt at a change for Beam on
> https://issues.apache.org/jira/browse/BEAM-7819.  This parses the
> message_id when reading from the PubSub protobuf and adds to the
> message_id, as suggested by the existing documentation -
> https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes
>
>
> My code is currently checked into my fork:-
> https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id
>
>
> Prior to creating a pull request, I just wanted to check the approach is
> sensible and then to go through the process for pull request to ensure it
> goes smoothly.  I do wonder if long term despite the documentation
> indicated this would be part of the attributes property it might be more
> sensible to fully mimic the protobuf for the PubSub message.
>
>
> I agree that making it similar to the protobuf sounds like a more sensible
> approach. Take a look at how it is handled in Java. There is also value in
> consistency across languages. Final decision could be made in the PR.
>
> Ahmet
>
> /cc +Udi Meiri <[email protected]>
>
>
>
> Kind regards
>
> Matthew
>
>

Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Reply via email to