You can use the general-purpose FileIO. It was designed to support pretty
much anything not explicitly supported by the IOs for concrete file formats
bundled with Beam, eg TextIO and AvroIO.

p.apply(FileIO.match().filepattern("...")).apply(FileIO.readMatches()) will
give you a PCollection<ReadableFile> that you can use to do any file
processing you want using regular Java libraries, e.g. ReadableFile.open()
gives a ReadableByteChannel (use Channels.newInputStream() to convert it to
an InputStream).

On Tue, Feb 13, 2018 at 9:42 PM Anant Chaudhary <an...@getrocket.com> wrote:

> Hello Beam Devs,
> We are starting to explore apache beam and google cloud dataflow. Seems
> like it can fit some of our data processing use cases pretty well. Some of
> my colleagues have worked with Apache Spark in the past, however the
> promise of not having to manage the servers has us inclining towards
> dataflow right now.
> A lot of the raw data that we have sits in S3 buckets as either single
> JSON object, or a JSON array of multiple objects. I see on the beam wiki
> that a JSON source may be in the works, or at least is being discussed.
> https://beam.apache.org/documentation/io/built-in/
> https://issues.apache.org/jira/browse/BEAM-1581
> I do also see the docs recommend thinking hard before trying to write a
> new source. Being a newbie to this world, I might be missing a more
> straightforward solution to the problem.
> The pipeline I had in mind was  read from s3 source -> convert to json
> objects -> (if arrays, then flatMap) -> filter -> groupby -> collect
> In the initial step however the textIO source splits the file in to lines
> (in trying to speed up the reading I suppose) - happens on files in gs or
> local disk.
> Is there a way to recombine lines from a 'single file' back in to one
> string which can be JSON parsed? Seems like a group operation in the
> pipeline, cant see the textIO sending the filename/line numbers to the
> downstream transform, which could group the data back.
> I can try to hack a custom source for our use case, but thought I'll shoot
> you guys a note (wiki says I should :-)
> Let me know if you guys have thoughts, and apologize for what might be a
> super noob question. After spending a day reading beam wiki, googling and
> stackoverflow, I figured might be worth a shot.
> Thanks
> Anant

Reply via email to