Hi community! Our team is working on implementing the SparkReceiverIO connector [1]. This connector uses a SparkReceiver [2] as a streaming source for receiving data from third-party services. Third-party services are typically APIs. The difference for our use case is that source provides interface & format, however, does not provide a Spark execution environment. In this regard, source SparkReceiver does not have its own data store, which we could use for checkpointing or worker failure case handling.
We are thinking about some ideas like temporary files or external storage to store SparkReceiver pipeline checkpoints in our case. Iād like to ask the community some questions that could help us: 1. Are there any examples of sources similar to our case ā a source without its own storage, for which the Apache Beam IO Read interface is implemented? 2. Are there any restrictions on the use of temp files in the SDF context (Dataflow runner)? 3. What external storage do you think would be suitable for SparkReceiverIO? Thanks in advance for any of your help! Elizaveta [1] [BEAM-14378] [CdapIO] SparkReceiverIO Read via SDF ā https://github.com/apache/beam/pull/17828 [2] Spark Streaming Custom Receivers ā https://spark.apache.org/docs/latest/streaming-custom-receivers.html