Hi community!

Our team is working on implementing the SparkReceiverIO connector [1]. This 
connector uses a SparkReceiver [2] as a streaming source for receiving data 
from third-party services. Third-party services are typically APIs. The 
difference for our use case is that source provides interface & format, 
however, does not provide a Spark execution environment. In this regard, source 
SparkReceiver does not have its own data store, which we could use for 
checkpointing or worker failure case handling.


We are thinking about some ideas like temporary files or external storage to 
store SparkReceiver pipeline checkpoints in our case.


I’d like to ask the community some questions that could help us:

  1.  Are there any examples of sources similar to our case – a source without 
its own storage, for which the Apache Beam IO Read interface is implemented?

  2.  Are there any restrictions on the use of temp files in the SDF context 
(Dataflow runner)?

  3.  What external storage do you think would be suitable for SparkReceiverIO?


Thanks in advance for any of your help!

Elizaveta



[1] [BEAM-14378] [CdapIO] SparkReceiverIO Read via SDF – 
https://github.com/apache/beam/pull/17828

[2] Spark Streaming Custom Receivers – 
https://spark.apache.org/docs/latest/streaming-custom-receivers.html


Reply via email to