json source for a pipeline

Anant Chaudhary Tue, 13 Feb 2018 21:43:04 -0800

Hello Beam Devs,

We are starting to explore apache beam and google cloud dataflow. Seems
like it can fit some of our data processing use cases pretty well. Some of
my colleagues have worked with Apache Spark in the past, however the
promise of not having to manage the servers has us inclining towards
dataflow right now.


A lot of the raw data that we have sits in S3 buckets as either single JSON
object, or a JSON array of multiple objects. I see on the beam wiki that a
JSON source may be in the works, or at least is being discussed.

https://beam.apache.org/documentation/io/built-in/
https://issues.apache.org/jira/browse/BEAM-1581

I do also see the docs recommend thinking hard before trying to write a new
source. Being a newbie to this world, I might be missing a more
straightforward solution to the problem.

The pipeline I had in mind was  read from s3 source -> convert to json
objects -> (if arrays, then flatMap) -> filter -> groupby -> collect

In the initial step however the textIO source splits the file in to lines
(in trying to speed up the reading I suppose) - happens on files in gs or
local disk.

Is there a way to recombine lines from a 'single file' back in to one
string which can be JSON parsed? Seems like a group operation in the
pipeline, cant see the textIO sending the filename/line numbers to the
downstream transform, which could group the data back.

I can try to hack a custom source for our use case, but thought I'll shoot
you guys a note (wiki says I should :-)

Let me know if you guys have thoughts, and apologize for what might be a
super noob question. After spending a day reading beam wiki, googling and
stackoverflow, I figured might be worth a shot.

Thanks
Anant

json source for a pipeline

Reply via email to