On Wed, Jul 10, 2019 at 9:56 AM Rui Wang <[email protected]> wrote:

> The second link points to the first join utility in Beam. The idea is
> similar: people can use the utility to do joins without writing them own.
> BeamSQL also uses it.
>
> The first link points to Schema API. I actually thought Schema API also
> uses the join utility, and turns out it doesn't (I am not sure what's the
> reason though).
>

The Schema one is more general as well, in that it supports joining N
inputs.


>
> Basically I think it's encouraged to reuse the same join utility if
> possible.
>
> -Rui
>
> On Wed, Jul 10, 2019 at 8:01 AM Shannon Duncan <[email protected]>
> wrote:
>
>> So it seams that the Java SDK has two different Join libraries?
>>
>> With Schema:
>> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>> And Another one:
>> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java
>>
>> So how does it handle that?
>>
>> On Mon, Jul 8, 2019 at 12:39 PM Shannon Duncan <
>> [email protected]> wrote:
>>
>>> Yeah these are for local testing right now. I was hoping to gain insight
>>> on better naming.
>>>
>>> I was thinking of creating an "extras" module.
>>>
>>> On Mon, Jul 8, 2019, 12:28 PM Robin Qiu <[email protected]> wrote:
>>>
>>>> Hi Shannon,
>>>>
>>>> Thanks for sharing the repo! I took a quick look and I have a concern
>>>> with the naming of the transforms.
>>>>
>>>> Currently, Beam Java already have "Select" and "Join" transforms.
>>>> However, they work on schemas, a feature that is not yet implemented in
>>>> Beam Python. (See
>>>> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>>> )
>>>>
>>>> To maintain consistency between SDKs, I think it is good to avoid
>>>> having two different transforms with the same name but different functions.
>>>> So maybe you can consider renaming the transforms or/and putting it in an
>>>> extension Python module, instead of the main ones?
>>>>
>>>> Best,
>>>> Robin
>>>>
>>>> On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan <
>>>> [email protected]> wrote:
>>>>
>>>>> As a follow up. Here is the repo that contains the utilities for now.
>>>>> https://github.com/shadowcodex/apache-beam-utilities. Will put
>>>>> together a proper PR as code gets closer to production quality.
>>>>>
>>>>> - Shannon
>>>>>
>>>>> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Frederik,
>>>>>>
>>>>>> That's exactly where I was looking. I did get permission to open
>>>>>> source the utilities module. So I'm going to throw them up on my personal
>>>>>> github soon and share with the email group for a look over.
>>>>>>
>>>>>> I'm going to work on the utilities there because it's a quick dev
>>>>>> environment and then once they are ready for proper PR I'll begin working
>>>>>> them into the actual SDK for a PR.
>>>>>>
>>>>>> I also joined the slack #beam and #beam-python channels, I was unsure
>>>>>> of where most collaborators discussed items.
>>>>>>
>>>>>> - Shannon
>>>>>>
>>>>>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Shannon,
>>>>>>>
>>>>>>> This is probably a good starting point:
>>>>>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
>>>>>>> .
>>>>>>>
>>>>>>> Frederik
>>>>>>>
>>>>>>> [image: https://ml6.eu]
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>>>>>>
>>>>>>>
>>>>>>> * Frederik Bode*
>>>>>>>
>>>>>>> ML6 Ghent
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>>>>>>> +32 4 92 78 96 18
>>>>>>>
>>>>>>>
>>>>>>> **** DISCLAIMER ****
>>>>>>>
>>>>>>> This email and any files transmitted with it are confidential and
>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>> addressed. If you have received this email in error please notify the
>>>>>>> system manager. This message contains confidential information and is
>>>>>>> intended only for the individual named. If you are not the named 
>>>>>>> addressee
>>>>>>> you should not disseminate, distribute or copy this e-mail. Please 
>>>>>>> notify
>>>>>>> the sender immediately by e-mail if you have received this e-mail by
>>>>>>> mistake and delete this e-mail from your system. If you are not the
>>>>>>> intended recipient you are notified that disclosing, copying, 
>>>>>>> distributing
>>>>>>> or taking any action in reliance on the contents of this information is
>>>>>>> strictly prohibited.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I'm sure I could use some of the existing aggregations as a guide
>>>>>>>> on how to make aggregations to fill the gap of missing ones. Such as
>>>>>>>> creating Sum/Max/Min.
>>>>>>>>
>>>>>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey
>>>>>>>> unless you are thinking of a different type of GroupBy?
>>>>>>>>
>>>>>>>> - Shannon
>>>>>>>>
>>>>>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Rui
>>>>>>>>>
>>>>>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Valentyn,
>>>>>>>>>>
>>>>>>>>>> I'll outline the utilities and accept any suggestions to add /
>>>>>>>>>> modify. These are really just shortcut PTransforms that I am working 
>>>>>>>>>> on to
>>>>>>>>>> simplify creating pipelines.
>>>>>>>>>>
>>>>>>>>>> Currently the utilities contain the following PTransforms:
>>>>>>>>>>
>>>>>>>>>> - Inner Join
>>>>>>>>>> - Left Outer Join
>>>>>>>>>> - Right Outer Join
>>>>>>>>>> - Full Outer Join
>>>>>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key
>>>>>>>>>> for the joins)
>>>>>>>>>> - Select (very simple filter that returns only items you want
>>>>>>>>>> from the dictionary) (allows for defining a default nullValue)
>>>>>>>>>>
>>>>>>>>>> Currently these operations only work with dictionaries, but I'd
>>>>>>>>>> be interested to see how it would work for <K,V> tuples.
>>>>>>>>>>
>>>>>>>>>> I'm new to python so they may not be optimized or the best way,
>>>>>>>>>> but from my understanding these seem to be the best way to do these 
>>>>>>>>>> types
>>>>>>>>>> of operations. Essentially I created a pipeline to be able to 
>>>>>>>>>> convert a
>>>>>>>>>> simple sql query into a flow of these utilities. Using prepareKey to 
>>>>>>>>>> define
>>>>>>>>>> your joining key, joining, and then selecting from the join allows 
>>>>>>>>>> you to
>>>>>>>>>> do a lot of powerful manipulation in a simple / familiar way.
>>>>>>>>>>
>>>>>>>>>> If this is something that we'd like to add to the Beam SDK I
>>>>>>>>>> don't mind looking at the contributor license agreement, and 
>>>>>>>>>> conversing
>>>>>>>>>> more on how to get them in.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shannon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Shannon,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>>>>>>>> direct contribution to Beam SDK, your change will reach larger 
>>>>>>>>>>> audience of
>>>>>>>>>>> users, and you will not have to maintain a separate project and 
>>>>>>>>>>> keep it up
>>>>>>>>>>> to date with new releases of Beam.
>>>>>>>>>>>
>>>>>>>>>>> I encourage you to take a look at
>>>>>>>>>>> https://beam.apache.org/contribute/ for general advice on how
>>>>>>>>>>> to get started. To echo some points mentioned in the guide:
>>>>>>>>>>>
>>>>>>>>>>> - If your change is large or it is your first change, it is a
>>>>>>>>>>> good idea to discuss it on the dev@ mailing list
>>>>>>>>>>> - For large changes create a design doc (template, examples) and
>>>>>>>>>>> email it to the dev@ mailing list.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Valentyn
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I have been writing a bunch of utilities for the python SDK
>>>>>>>>>>>> such as joins, selections, composite transforms, etc...
>>>>>>>>>>>>
>>>>>>>>>>>> I am working with my company to see if I can open source the
>>>>>>>>>>>> utilities. Would it be best to post them on a separate PyPi 
>>>>>>>>>>>> project, or to
>>>>>>>>>>>> PR them into the beam SDK? I assume if they let me open source it 
>>>>>>>>>>>> they will
>>>>>>>>>>>> want some attribution or something like that.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shannon
>>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to