Re: json source for a pipeline

Jean-Baptiste Onofré Tue, 13 Feb 2018 22:19:22 -0800

Hi Anant,

did you take a look on the jackson extension:


https://github.com/apache/beam/tree/master/sdks/java/extensions/jackson

Maybe it does what you want (converting JSON as object).

Regards
JB

On 02/14/2018 03:50 AM, Anant Chaudhary wrote:
> Hello Beam Devs,
> 
> We are starting to explore apache beam and google cloud dataflow. Seems like 
> it
> can fit some of our data processing use cases pretty well. Some of my 
> colleagues
> have worked with Apache Spark in the past, however the promise of not having 
> to
> manage the servers has us inclining towards dataflow right now.
> 
> A lot of the raw data that we have sits in S3 buckets as either single JSON
> object, or a JSON array of multiple objects. I see on the beam wiki that a 
> JSON
> source may be in the works, or at least is being discussed.
> 
> https://beam.apache.org/documentation/io/built-in/
> https://issues.apache.org/jira/browse/BEAM-1581
> 
> I do also see the docs recommend thinking hard before trying to write a new
> source. Being a newbie to this world, I might be missing a more 
> straightforward
> solution to the problem.
> 
> The pipeline I had in mind was  read from s3 source -> convert to json objects
> -> (if arrays, then flatMap) -> filter -> groupby -> collect
> 
> In the initial step however the textIO source splits the file in to lines (in
> trying to speed up the reading I suppose) - happens on files in gs or local 
> disk.
> 
> Is there a way to recombine lines from a 'single file' back in to one string
> which can be JSON parsed? Seems like a group operation in the pipeline, cant 
> see
> the textIO sending the filename/line numbers to the downstream transform, 
> which
> could group the data back.
> 
> I can try to hack a custom source for our use case, but thought I'll shoot you
> guys a note (wiki says I should :-)
> 
> Let me know if you guys have thoughts, and apologize for what might be a super
> noob question. After spending a day reading beam wiki, googling and
> stackoverflow, I figured might be worth a shot.
> 
> Thanks
> Anant

-- 
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: json source for a pipeline

Reply via email to