Re: 2 tier input

Lukasz Cwik Mon, 29 Oct 2018 10:35:43 -0700

Yes this will change. Apache Beam has been working towards a general
solution to make all IO connectors become modular[1]. This would allow you
to read from an arbitrary number of sources chaining the output from one to
the next.


1: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html

On Mon, Oct 29, 2018 at 9:57 AM Chaim Turkel <[email protected]> wrote:

> Both solutions mean that i cannot use the beam IO classes that will be
> me the distribution, but i would have to get the data myself using a
> ParDo method, is this something that will change in the future? i
> understand that spark has a push down method that will pass the filter
> to the next level of querys.
> chaim
> On Mon, Oct 22, 2018 at 4:02 PM Jeff Klukas <[email protected]> wrote:
> >
> > Chaim - If the full list of IDs is able to fit comfortably in memory and
> the Mongo collection is small enough that you can read the whole
> collection, you may want to fetch the IDs into a Java collection using the
> BigQuery API directly, then turn them into a Beam PCollection using
> Create.of(collection_of_ids). You could then use MongoDbIO.read() to read
> the entire collection, but throw out rows based on the side input of IDs.
> >
> > If the list of IDs is particularly small, you could fetch the collection
> into memory and parse that into a string filter that you pass to
> MongoDbIO.read() to specify which documents to fetch, avoiding the need for
> a side input.
> >
> > Otherwise, if it's a large number of IDs, you may need to use Beam's
> BigQueryIO to create a PCollection for the IDs, and then pass that into a
> ParDo with a custom DoFn that issues Mongo queries for a batch of IDs. I'm
> not very familiar with Mongo APIs, but you'd need to give the DoFn a
> connection to Mongo that's serializable. You could likely look at the
> implementation of MongoDbIO for inspiration there.
> >
> > On Sun, Oct 21, 2018 at 5:18 AM Chaim Turkel <[email protected]> wrote:
> >>
> >> hi,
> >>   I have the following flow i need to implement.
> >> From the bigquery i run a query and get a list of id's then i need to
> >> load from mongo all the documents based on these id's and export them
> >> as an xml file.
> >> How do you suggest i go about doing this?
> >>
> >> chaim
> >>
> >> --
> >>
> >>
> >> Loans are funded by
> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> >> Utah, member FDIC, Equal
> >> Opportunity Lender. Merchant Cash Advances are
> >> made by Behalf. For more
> >> information on ECOA, click here
> >> <https://www.behalf.com/legal/ecoa/>. For important information about
> >> opening a new
> >> account, review Patriot Act procedures here
> >> <https://www.behalf.com/legal/patriot/>.
> >> Visit Legal
> >> <https://www.behalf.com/legal/> to
> >> review our comprehensive program terms,
> >> conditions, and disclosures.
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> <https://www.behalf.com/legal/ecoa/>. For important information about
> opening a new
> account, review Patriot Act procedures here
> <https://www.behalf.com/legal/patriot/>.
> Visit Legal
> <https://www.behalf.com/legal/> to
> review our comprehensive program terms,
> conditions, and disclosures.
>

Re: 2 tier input

Reply via email to