Hi folks, Finding it a little hard to figure out the starting point. If someone from the community can pair with me on this for 30-45 mins, that would be great.
I have read the code in the pubsubio package and drilled down into the `beam.External` function as well but couldn't make much sense. Thanks, Shivam Singhal On Mon, 6 Feb 2023 at 22:45, Robert Burke <[email protected]> wrote: > One solution would be to make the substitution only at Pipeline Submission > time, when we know if the job is being submitted to Batch Dataflow or not, > and either 1. Fail with a clear error message, or 2. Substitute the > external transform with the batch publish DoFn. > > This is closer to how Java and Python implement their substitutions for > portable implementations. > > This should be authored so the other runner adapters can reuse it when > they don't support a "native" PubsubIO. This leads to predictable behavior > between runners and for the implementation. > > This is the best approach since the runner based IO is preferable to > trying to switch out portable dofns. Arbitrary DoFns are a pain to single > out and detect, while the external approach has a known URN to key off of. > > ------ > > Another approach would be to change the existing pubsubio.Sink call so > that it checks if the input PCollection is Unbounded or not. > > If it's Unbounded, the current approach should be used. > > If it's Bounded, then we're likely in a situation where the "runner > provided PubsubIO Sink" isn't going to be available, so we should use a > backup dofn to publish in batch situations. > > This isn't a great solution because it leads to less predictable behavior. > Also it runs the risk of using the batch implementation instead of a tuned > runner implementation in a streaming pipeline, which will likely have > performance consequences. > > ---- > > Personally I feel Dataflow should support the sink properly in batch > rather than differentiating between that and streaming but I'm aware their > implementation precludes that at the moment, as they also don't want to > have multiple implementations to maintain. > > Robert Burke > Beam Go Busybody > > On Mon, Feb 6, 2023, 8:44 AM Ritesh Ghorse via dev <[email protected]> > wrote: > >> I've added a comment to the StackOverflow question >> <https://stackoverflow.com/questions/69651665/go-apache-beam-gcp-dataflow-could-not-find-the-sink-for-pubsub-check-that-th>. >> I guess the suggested approach there is the workaround for now, >> unfortunately. You have to create a new client in the setup function there >> and add a publish call >> <https://pkg.go.dev/cloud.google.com/go/pubsub#Topic.Publish> in the >> `publishBatch()` >> >> Also @shivam if you can pick that issue it would be great. Feel free to >> post questions/ask for help here >> >> On Mon, Feb 6, 2023 at 7:13 AM Shivam Singhal < >> [email protected]> wrote: >> >>> I will be picking the issue up once the maintainers have triaged the >>> issue. >>> >>> On Mon, 6 Feb 2023 at 17:43, Shivam Singhal <[email protected]> >>> wrote: >>> >>>> Not sure if there is any solution other than fixing the Go pubsubio >>>> package. >>>> >>>> On Mon, 6 Feb 2023 at 17:41, Ashok KS <[email protected]> wrote: >>>> >>>>> Yes, that is where Iam getting stuck. I wrote the complete pipeline in >>>>> Python which reads from the BQ table and published it as a PubSub message. >>>>> I'm able to force it as a streaming application by passing >>>>> --streaming=True >>>>> But for my project, they want it in Go so I had to rewrite the >>>>> complete logic in Go. >>>>> I did the same, but stuck at the last point of publishing it to PubSub. >>>>> >>>>> On Mon, Feb 6, 2023 at 11:07 PM Shivam Singhal < >>>>> [email protected]> wrote: >>>>> >>>>>> It depends on the input source: it will decide if your pipeline is a >>>>>> streaming or a batch pipeline. >>>>>> >>>>>> Since you are querying over a BQ table, the input is finite and in >>>>>> result, your pipeline is a batch pipeline. >>>>>> I am not sure there is a straightforward way where you can convert >>>>>> this pipeline into a streaming pipeline. >>>>>> >>>>>> >>>>>> On Mon, 6 Feb 2023 at 17:32, Ashok KS <[email protected]> wrote: >>>>>> >>>>>>> Hi Shivam, >>>>>>> >>>>>>> Thanks for that. How can run the pipeline as a streaming pipeline? >>>>>>> In python I could just run the pipeline by passing —streaming=True in >>>>>>> the >>>>>>> command line, but I couldn’t find anything similar in Go. >>>>>>> >>>>>>> Any pointers would be appreciated. >>>>>>> >>>>>>> Regards, >>>>>>> Ashok >>>>>>> >>>>>>> On Mon, 6 Feb 2023 at 10:59 pm, Shivam Singhal < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> The issue is not yet verified by the maintainers but I think the >>>>>>>> pubsubio connector's Write method doesn't work in Batch pipelines. >>>>>>>> >>>>>>>> But I am pretty sure that pubsubio Write doesn't work for Batch >>>>>>>> Pipelines because it's mentioned in the code comments. Check the below >>>>>>>> issue for the details: >>>>>>>> https://github.com/apache/beam/issues/25326 >>>>>>>> >>>>>>>> On Mon, 6 Feb 2023 at 17:26, Ashok KS <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Shivam, >>>>>>>>> >>>>>>>>> Thanks a lot for your response. Yes it is a batch pipeline. My >>>>>>>>> task is to read a big query table, process the data and publish the >>>>>>>>> Rows as >>>>>>>>> a PubSub message. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Ashok >>>>>>>>> >>>>>>>>> On Mon, 6 Feb 2023 at 10:52 pm, Shivam Singhal < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hey Ashok KS, >>>>>>>>>> >>>>>>>>>> Is this a batch pipeline? >>>>>>>>>> >>>>>>>>>> On Mon, 6 Feb 2023 at 09:27, Ashok KS <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> Just sending a reminder in case anyone could help. I haven't >>>>>>>>>>> received any response to my issue. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Ashok >>>>>>>>>>> >>>>>>>>>>> On Fri, Feb 3, 2023 at 12:23 AM Ashok KS <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi All, >>>>>>>>>>>> >>>>>>>>>>>> I'm new to using Apache Beam using Go. >>>>>>>>>>>> >>>>>>>>>>>> pubsubio.Write(scope, "project", "topic", ppMessages) >>>>>>>>>>>> When I try to publish a message in a topic I get an error >>>>>>>>>>>> message >>>>>>>>>>>> "Could not find the sink for pubsub, Check that the sink >>>>>>>>>>>> library specifies alwayslink = 1 >>>>>>>>>>>> >>>>>>>>>>>> I found a StackOverFlow post for the same issue but it doesn't >>>>>>>>>>>> solve the problem. >>>>>>>>>>>> >>>>>>>>>>>> Stackoverflow Link >>>>>>>>>>>> <https://stackoverflow.com/questions/69651665/go-apache-beam-gcp-dataflow-could-not-find-the-sink-for-pubsub-check-that-th> >>>>>>>>>>>> >>>>>>>>>>>> Can someone please help? >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Ashok >>>>>>>>>>>> >>>>>>>>>>>
