Re: Custom structured streaming source for a system with non-repeatable reads?

Mich Talebzadeh Tue, 01 Feb 2022 11:02:15 -0800

OK. Will this work with something like
<https://cloud.google.com/pubsub/lite/docs/spark>Using Pub/Sub Lite with
Apache Spark <https://cloud.google.com/pubsub/lite/docs/spark>?



Personally I have not used Pub/Sub with Spark (as opposed to Kafka with
Spark on Dataproc). Now I am sure there is a business value proposition to
use Pub/Sub in the whole chain.


HTH




   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 1 Feb 2022 at 18:01, Daniel Collins <[email protected]> wrote:

> Hello,
>
> I am trying to enable users to read data from Google Cloud Pub/Sub (CPS
> hereafter) using the structured streaming and SQL apis.
>
> > Pub/Sub is a messaging queue best akin to RabbitMQ, distinct from Kafka
>
> This is not the best comparison- CPS has message storage and is most often
> used for long lived streaming data fanout, in identical use cases to apache
> Kafka. Subscriptions and topics in CPS are also typically more long lived
> and heavyweight than in RabbitMQ.
>
> > will Pub/Sub still be required to pass the topics
>
> Yes, the user will still be required to configure a topic/subscription
> externally, the goal is for them to be able to manipulate their data using
> the dataframe APIs.
>
> -Daniel
>
> On Tue, Feb 1, 2022 at 12:34 PM Mich Talebzadeh <[email protected]>
> wrote:
>
>> Hi,
>>
>> It is not very clear to me what you mean and I quote "possible to
>> implement a structured streaming source fo Pub/Sub"
>>
>> Pub/Sub is a messaging queue best akin to RabbitMQ, distinct from Kafka.
>> It would be interesting what this proposal of yours is going to achieve? On
>> the face of it you are trying to make Pub/Sub behave like SSS. If that is
>> the case, then will Pub/Sub still be required to pass the topics?
>>
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 1 Feb 2022 at 15:54, Daniel Collins <[email protected]>
>> wrote:
>>
>>> Hello spark developers,
>>>
>>> I'm trying to figure out if it would be possible to implement a
>>> structured streaming source for Google Pub/Sub
>>> <https://cloud.google.com/pubsub>, however, the continuous reader
>>> programming model has some impedance with this system. Google Pub/Sub is
>>> not a partitioned system by nature, and does not have an offset concept,
>>> instead using per message acknowledgement IDs (ackIDs) to track subscriber
>>> progress. However, I think it should be possible to work around this
>>> impedance and implement a structured streaming source as follows:
>>>
>>> - Create some large fixed number (possibly configurable) of input
>>> partitions with no semantic meaning (Pub/Sub is not a partitioned system)
>>> - Create an Offset which encapsulates the set of read ackIDs to
>>> propagate from the ContinuousInputPartitionReader back to the
>>> ContinuousReader
>>> -In the commit() method, acknowledge all the ids that are in the set
>>> with that offset
>>> -Return true from `needsReconfiguration` every minute or so to prevent
>>> ContinuousInputPartitionReaders from accumulating too many ackIDs in their
>>> local state
>>>
>>> Does this sound like a feasible way to implement a source for this
>>> system? Can anyone anticipate any issues with this approach? Are there any
>>> other approaches which might solve this problem better?
>>>
>>> Looking forward to your responses!
>>>
>>> Thank you,
>>>
>>> Daniel
>>>
>>> P.S. If you've received this already, apologies for the spam, I can't
>>> see it in the archives and only recently successfully subscribed to the
>>> mailing list.
>>>
>>

Re: Custom structured streaming source for a system with non-repeatable reads?

Reply via email to