Have you considered using the ReadAllFromAvro transform:
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/extensions/avro/io/AvroIO.ReadAll.html

On Sun, Oct 27, 2024 at 10:41 AM Francesco Scipioni
<scipion...@gmail.com> wrote:
>
> Hi everyone,
>
> I am currently working on implementing a custom I/O connector using Apache 
> Beam, specifically aiming to run it on Google Cloud Dataflow. Here’s a quick 
> overview of my workflow:
>
> I receive an event via Pub/Sub, which includes a pointer to an Avro file 
> stored in a GCS bucket. My goal is to wrap Dataflow’s functionality to make 
> this a streaming process while also leveraging autoscaling capabilities on 
> GCS metadata and not on PubSub metadata.
>
> My current approach is to create an *Unbounded Pub/Sub Source* that wraps a 
> *Bounded GCS/Avro Source*. I intend to use checkpoints by leveraging metadata 
> from both the Avro file and Pub/Sub.
>
> However, I'm encountering challenges with implementing the checkpointing 
> mechanism, and I'm unsure about the optimal approach to get started with the 
> project setup.
>
> Would anyone be able to provide guidance, suggestions, or examples? I’d 
> really appreciate any help you can offer.
>
> Thank you very much!
>
> Best regards,
> Francesco

Reply via email to