On Wed, Jul 10, 2019 at 9:56 AM Rui Wang <[email protected]> wrote: > The second link points to the first join utility in Beam. The idea is > similar: people can use the utility to do joins without writing them own. > BeamSQL also uses it. > > The first link points to Schema API. I actually thought Schema API also > uses the join utility, and turns out it doesn't (I am not sure what's the > reason though). >
The Schema one is more general as well, in that it supports joining N inputs. > > Basically I think it's encouraged to reuse the same join utility if > possible. > > -Rui > > On Wed, Jul 10, 2019 at 8:01 AM Shannon Duncan <[email protected]> > wrote: > >> So it seams that the Java SDK has two different Join libraries? >> >> With Schema: >> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms >> And Another one: >> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java >> >> So how does it handle that? >> >> On Mon, Jul 8, 2019 at 12:39 PM Shannon Duncan < >> [email protected]> wrote: >> >>> Yeah these are for local testing right now. I was hoping to gain insight >>> on better naming. >>> >>> I was thinking of creating an "extras" module. >>> >>> On Mon, Jul 8, 2019, 12:28 PM Robin Qiu <[email protected]> wrote: >>> >>>> Hi Shannon, >>>> >>>> Thanks for sharing the repo! I took a quick look and I have a concern >>>> with the naming of the transforms. >>>> >>>> Currently, Beam Java already have "Select" and "Join" transforms. >>>> However, they work on schemas, a feature that is not yet implemented in >>>> Beam Python. (See >>>> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms >>>> ) >>>> >>>> To maintain consistency between SDKs, I think it is good to avoid >>>> having two different transforms with the same name but different functions. >>>> So maybe you can consider renaming the transforms or/and putting it in an >>>> extension Python module, instead of the main ones? >>>> >>>> Best, >>>> Robin >>>> >>>> On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan < >>>> [email protected]> wrote: >>>> >>>>> As a follow up. Here is the repo that contains the utilities for now. >>>>> https://github.com/shadowcodex/apache-beam-utilities. Will put >>>>> together a proper PR as code gets closer to production quality. >>>>> >>>>> - Shannon >>>>> >>>>> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks Frederik, >>>>>> >>>>>> That's exactly where I was looking. I did get permission to open >>>>>> source the utilities module. So I'm going to throw them up on my personal >>>>>> github soon and share with the email group for a look over. >>>>>> >>>>>> I'm going to work on the utilities there because it's a quick dev >>>>>> environment and then once they are ready for proper PR I'll begin working >>>>>> them into the actual SDK for a PR. >>>>>> >>>>>> I also joined the slack #beam and #beam-python channels, I was unsure >>>>>> of where most collaborators discussed items. >>>>>> >>>>>> - Shannon >>>>>> >>>>>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Shannon, >>>>>>> >>>>>>> This is probably a good starting point: >>>>>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68 >>>>>>> . >>>>>>> >>>>>>> Frederik >>>>>>> >>>>>>> [image: https://ml6.eu] >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=> >>>>>>> >>>>>>> >>>>>>> * Frederik Bode* >>>>>>> >>>>>>> ML6 Ghent >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=> >>>>>>> +32 4 92 78 96 18 >>>>>>> >>>>>>> >>>>>>> **** DISCLAIMER **** >>>>>>> >>>>>>> This email and any files transmitted with it are confidential and >>>>>>> intended solely for the use of the individual or entity to whom they are >>>>>>> addressed. If you have received this email in error please notify the >>>>>>> system manager. This message contains confidential information and is >>>>>>> intended only for the individual named. If you are not the named >>>>>>> addressee >>>>>>> you should not disseminate, distribute or copy this e-mail. Please >>>>>>> notify >>>>>>> the sender immediately by e-mail if you have received this e-mail by >>>>>>> mistake and delete this e-mail from your system. If you are not the >>>>>>> intended recipient you are notified that disclosing, copying, >>>>>>> distributing >>>>>>> or taking any action in reliance on the contents of this information is >>>>>>> strictly prohibited. >>>>>>> >>>>>>> >>>>>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I'm sure I could use some of the existing aggregations as a guide >>>>>>>> on how to make aggregations to fill the gap of missing ones. Such as >>>>>>>> creating Sum/Max/Min. >>>>>>>> >>>>>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey >>>>>>>> unless you are thinking of a different type of GroupBy? >>>>>>>> >>>>>>>> - Shannon >>>>>>>> >>>>>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <[email protected]> wrote: >>>>>>>> >>>>>>>>> Maybe also adding Aggregation/GroupBy as utilities? >>>>>>>>> >>>>>>>>> >>>>>>>>> -Rui >>>>>>>>> >>>>>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Thanks Valentyn, >>>>>>>>>> >>>>>>>>>> I'll outline the utilities and accept any suggestions to add / >>>>>>>>>> modify. These are really just shortcut PTransforms that I am working >>>>>>>>>> on to >>>>>>>>>> simplify creating pipelines. >>>>>>>>>> >>>>>>>>>> Currently the utilities contain the following PTransforms: >>>>>>>>>> >>>>>>>>>> - Inner Join >>>>>>>>>> - Left Outer Join >>>>>>>>>> - Right Outer Join >>>>>>>>>> - Full Outer Join >>>>>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key >>>>>>>>>> for the joins) >>>>>>>>>> - Select (very simple filter that returns only items you want >>>>>>>>>> from the dictionary) (allows for defining a default nullValue) >>>>>>>>>> >>>>>>>>>> Currently these operations only work with dictionaries, but I'd >>>>>>>>>> be interested to see how it would work for <K,V> tuples. >>>>>>>>>> >>>>>>>>>> I'm new to python so they may not be optimized or the best way, >>>>>>>>>> but from my understanding these seem to be the best way to do these >>>>>>>>>> types >>>>>>>>>> of operations. Essentially I created a pipeline to be able to >>>>>>>>>> convert a >>>>>>>>>> simple sql query into a flow of these utilities. Using prepareKey to >>>>>>>>>> define >>>>>>>>>> your joining key, joining, and then selecting from the join allows >>>>>>>>>> you to >>>>>>>>>> do a lot of powerful manipulation in a simple / familiar way. >>>>>>>>>> >>>>>>>>>> If this is something that we'd like to add to the Beam SDK I >>>>>>>>>> don't mind looking at the contributor license agreement, and >>>>>>>>>> conversing >>>>>>>>>> more on how to get them in. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Shannon >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Shannon, >>>>>>>>>>> >>>>>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a >>>>>>>>>>> direct contribution to Beam SDK, your change will reach larger >>>>>>>>>>> audience of >>>>>>>>>>> users, and you will not have to maintain a separate project and >>>>>>>>>>> keep it up >>>>>>>>>>> to date with new releases of Beam. >>>>>>>>>>> >>>>>>>>>>> I encourage you to take a look at >>>>>>>>>>> https://beam.apache.org/contribute/ for general advice on how >>>>>>>>>>> to get started. To echo some points mentioned in the guide: >>>>>>>>>>> >>>>>>>>>>> - If your change is large or it is your first change, it is a >>>>>>>>>>> good idea to discuss it on the dev@ mailing list >>>>>>>>>>> - For large changes create a design doc (template, examples) and >>>>>>>>>>> email it to the dev@ mailing list. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Valentyn >>>>>>>>>>> >>>>>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I have been writing a bunch of utilities for the python SDK >>>>>>>>>>>> such as joins, selections, composite transforms, etc... >>>>>>>>>>>> >>>>>>>>>>>> I am working with my company to see if I can open source the >>>>>>>>>>>> utilities. Would it be best to post them on a separate PyPi >>>>>>>>>>>> project, or to >>>>>>>>>>>> PR them into the beam SDK? I assume if they let me open source it >>>>>>>>>>>> they will >>>>>>>>>>>> want some attribution or something like that. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Shannon >>>>>>>>>>>> >>>>>>>>>>>
