Hi, You can use the general-purpose FileIO. It was designed to support pretty much anything not explicitly supported by the IOs for concrete file formats bundled with Beam, eg TextIO and AvroIO.
E.g.: p.apply(FileIO.match().filepattern("...")).apply(FileIO.readMatches()) will give you a PCollection<ReadableFile> that you can use to do any file processing you want using regular Java libraries, e.g. ReadableFile.open() gives a ReadableByteChannel (use Channels.newInputStream() to convert it to an InputStream). On Tue, Feb 13, 2018 at 9:42 PM Anant Chaudhary <an...@getrocket.com> wrote: > Hello Beam Devs, > > We are starting to explore apache beam and google cloud dataflow. Seems > like it can fit some of our data processing use cases pretty well. Some of > my colleagues have worked with Apache Spark in the past, however the > promise of not having to manage the servers has us inclining towards > dataflow right now. > > A lot of the raw data that we have sits in S3 buckets as either single > JSON object, or a JSON array of multiple objects. I see on the beam wiki > that a JSON source may be in the works, or at least is being discussed. > > https://beam.apache.org/documentation/io/built-in/ > https://issues.apache.org/jira/browse/BEAM-1581 > > I do also see the docs recommend thinking hard before trying to write a > new source. Being a newbie to this world, I might be missing a more > straightforward solution to the problem. > > The pipeline I had in mind was read from s3 source -> convert to json > objects -> (if arrays, then flatMap) -> filter -> groupby -> collect > > In the initial step however the textIO source splits the file in to lines > (in trying to speed up the reading I suppose) - happens on files in gs or > local disk. > > Is there a way to recombine lines from a 'single file' back in to one > string which can be JSON parsed? Seems like a group operation in the > pipeline, cant see the textIO sending the filename/line numbers to the > downstream transform, which could group the data back. > > I can try to hack a custom source for our use case, but thought I'll shoot > you guys a note (wiki says I should :-) > > Let me know if you guys have thoughts, and apologize for what might be a > super noob question. After spending a day reading beam wiki, googling and > stackoverflow, I figured might be worth a shot. > > Thanks > Anant >