Re: pipeline steps

Jeff Klukas Thu, 07 Feb 2019 06:19:19 -0800

I haven't needed to do this with Beam before, but I've definitely had
similar needs in the past. Spark, for example, provides an input_file_name
function that can be applied to a dataframe to add the input file as an
additional column. It's not clear to me how that's implemented, though.

Perhaps others have suggestions, but I'm not aware of a way to do this
conveniently in Beam today. To my knowledge, today you would have to use
FileIO.match() and FileIO.readMatches() to get a collection of
ReadableFile. You'd then have to FlatMapElements to pull out the metadata
and the bytes of the file, and you'd be responsible for parsing those bytes
into avro records. You'd  be able to output something like a KV<String, T>
that groups the file name together with the parsed avro record.

Seems like something worth providing better support for in Beam itself if
this indeed doesn't already exist.

On Thu, Feb 7, 2019 at 7:29 AM Chaim Turkel <ch...@behalf.com> wrote:

> Hi,
>   I am working on a pipeline that listens to a topic on pubsub to get
> files that have changes in the storage. Then i read avro files, and
> would like to write them to bigquery based on the file name (to
> different tables).
>   My problem is that the transformer that reads the avro does not give
> me back the files name (like a tuple or something like that). I seem
> to have this pattern come back a lot.
> Can you think of any solutions?
>
> Chaim
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> <https://www.behalf.com/legal/ecoa/>. For important information about
> opening a new
> account, review Patriot Act procedures here
> <https://www.behalf.com/legal/patriot/>.
> Visit Legal
> <https://www.behalf.com/legal/> to
> review our comprehensive program terms,
> conditions, and disclosures.
>

Re: pipeline steps

Reply via email to