Hello Beam Devs, We are starting to explore apache beam and google cloud dataflow. Seems like it can fit some of our data processing use cases pretty well. Some of my colleagues have worked with Apache Spark in the past, however the promise of not having to manage the servers has us inclining towards dataflow right now.
A lot of the raw data that we have sits in S3 buckets as either single JSON object, or a JSON array of multiple objects. I see on the beam wiki that a JSON source may be in the works, or at least is being discussed. https://beam.apache.org/documentation/io/built-in/ https://issues.apache.org/jira/browse/BEAM-1581 I do also see the docs recommend thinking hard before trying to write a new source. Being a newbie to this world, I might be missing a more straightforward solution to the problem. The pipeline I had in mind was read from s3 source -> convert to json objects -> (if arrays, then flatMap) -> filter -> groupby -> collect In the initial step however the textIO source splits the file in to lines (in trying to speed up the reading I suppose) - happens on files in gs or local disk. Is there a way to recombine lines from a 'single file' back in to one string which can be JSON parsed? Seems like a group operation in the pipeline, cant see the textIO sending the filename/line numbers to the downstream transform, which could group the data back. I can try to hack a custom source for our use case, but thought I'll shoot you guys a note (wiki says I should :-) Let me know if you guys have thoughts, and apologize for what might be a super noob question. After spending a day reading beam wiki, googling and stackoverflow, I figured might be worth a shot. Thanks Anant