Hi Anant, did you take a look on the jackson extension:
https://github.com/apache/beam/tree/master/sdks/java/extensions/jackson Maybe it does what you want (converting JSON as object). Regards JB On 02/14/2018 03:50 AM, Anant Chaudhary wrote: > Hello Beam Devs, > > We are starting to explore apache beam and google cloud dataflow. Seems like > it > can fit some of our data processing use cases pretty well. Some of my > colleagues > have worked with Apache Spark in the past, however the promise of not having > to > manage the servers has us inclining towards dataflow right now. > > A lot of the raw data that we have sits in S3 buckets as either single JSON > object, or a JSON array of multiple objects. I see on the beam wiki that a > JSON > source may be in the works, or at least is being discussed. > > https://beam.apache.org/documentation/io/built-in/ > https://issues.apache.org/jira/browse/BEAM-1581 > > I do also see the docs recommend thinking hard before trying to write a new > source. Being a newbie to this world, I might be missing a more > straightforward > solution to the problem. > > The pipeline I had in mind was read from s3 source -> convert to json objects > -> (if arrays, then flatMap) -> filter -> groupby -> collect > > In the initial step however the textIO source splits the file in to lines (in > trying to speed up the reading I suppose) - happens on files in gs or local > disk. > > Is there a way to recombine lines from a 'single file' back in to one string > which can be JSON parsed? Seems like a group operation in the pipeline, cant > see > the textIO sending the filename/line numbers to the downstream transform, > which > could group the data back. > > I can try to hack a custom source for our use case, but thought I'll shoot you > guys a note (wiki says I should :-) > > Let me know if you guys have thoughts, and apologize for what might be a super > noob question. After spending a day reading beam wiki, googling and > stackoverflow, I figured might be worth a shot. > > Thanks > Anant -- Jean-Baptiste Onofré [email protected] http://blog.nanthrax.net Talend - http://www.talend.com
