Re: [GSOC] Build out Beam Use Cases

Danny McCormick via dev Wed, 28 Feb 2024 07:29:25 -0800

Hey Nivaldo, thanks for reaching out and for your interest in contributing
to Beam in this way! I'll do my best to answer your questions below:


> 1. Would I actually be eligible to apply to GSOC for this project, or do
I not count as an open source beginner anymore in this case? The total
number of PRs and issues I've ever opened on Github would be below 10 as
far as I'm aware. I've never worked formally as a Software Engineer, so I'd
have a lot to learn from a mentor and would be looking forward to that.

I'm not sure; I think this is probably a better question to ask the GSOC
organizers directly [1]. I'm not sure how they define software engineering
professional [2], but it sounds like you have at least done some
professional work with Beam and it would be good to get clarity on that
before you sink time into putting together an application.

> 2. I'd like to understand the scope and exact purpose of the use cases a
bit better. Are they meant to serve more like standalone tutorials with
purely mock data, or maybe more like reusable/adaptable examples where
users can fit in their own data? Additionally, is my assessment correct
that the implementation would consist basically of actual code, testing and
documentation?

Ideally they would be standalone tutorials using real datasets that show
end to end how you could use Beam. If we need to mock some data that is ok,
but the more we can rely on real open source datasets the better. Here's an
example [3] put together by a past contributor for a different transform. I
think the runnable notebook works really well for something like this,
though if you end up doing the project you'd certainly have major input on
proposing the best way to communicate this if a notebook doesn't seem best.

> 3. Would it be possible to define what exactly would count as a "slowly
changing source" for the purposes of the Enrichment use cases to be
implemented?

Basically, the idea is given a streaming pipeline (something reading from a
source like Kafka or PubSub), often users want to do a lookup in a more
stable data source (like BigTable or a feature store) to enrich their data.
This data source might change over time (for example, features might be
added to your feature store), but the rate of change is much slower than
your streaming source where you're getting a consistent stream of data. So
an enrichment slowly changing source can basically be any sort of queryable
database where you might want to do key-value lookups.

For reference, I'd recommend looking at this notebook [4] - it doesn't have
as robust a use case as we're talking about for this project, but it can
give you an idea of how Enrichment can be used.

> 4. Regarding the implementation of 1 or more additional Enrichment
handlers for currently unsupported sources, we'd be looking into adding,
for instance, something like a BigQueryEnrichmentHandler, is that correct?

Yes, that's right. I think there's other community work planned for
BigQuery specifically which should complete before the project start date,
but other possible sources include filesystem lookups (for example, I want
to enrich my data with the contents of a file from GCS or EFS), databases
(similar to BigQuery, where you want to do some sort of SQL based lookup),
or feature stores (particularly pertinent for the ML use case, here's an
in-flight example [5]).

Please let me know if you have any further questions!

Thanks,
Danny

[1] https://developers.google.com/open-source/gsoc/contact
[2]
https://developers.google.com/open-source/gsoc/faq#i_am_a_professional_software_engineer_but_i_have_not_participated_in_open_source_communities_before_am_i_eligible
[3]
https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/speech_emotion_tensorflow.ipynb
[4]
https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/bigtable_enrichment_transform.ipynb
[5] https://github.com/apache/beam/pull/30388

On Wed, Feb 28, 2024 at 9:12 AM Nivaldo Tokuda <nivaldo.humber...@gmail.com>
wrote:

> Hello!
>
> My name is Nivaldo, and I'd like to express my interest in joining this
> year's GSOC to add real world use cases for Beam's MLTransform/Enrichment
> transforms <https://issues.apache.org/jira/browse/GSOC-259>.
>
> ***About me***
> I am a Senior Data Engineer from Brazil, with 6-7 YOE in helping companies
> make the most out of their data. I've contributed to Beam in the past (See
> [1] <https://github.com/apache/beam/pull/23879> and [2]
> <https://github.com/apache/beam/issues/21089>), but I think I still fit
> the criteria of being a beginner in open source development (Using [3]
> <https://developers.google.com/open-source/gsoc/faq#how_do_i_know_if_i_am_considered_a_beginner_in_open_source_development>
> as a reference).
>
> Most notably I spent 2-3 months contributing to the creation of a Rust SDK
> for Beam, but due to unfortunate events, I abruptly stopped contributing. I
> was happy to see that some amazing members of the community have been able
> to fork the code I wrote and continue from there. Part of the reason I had
> for that contribution was to prepare a career transition into Software
> Engineering, but I also had to put that goal on hold at the time. Recently,
> my circumstances have changed and I have been preparing to continue with a
> more domain-specific version of this goal, more directed towards machine
> learning. Working on this project would be an excellent way to increment my
> portfolio, learn relevant skills and contribute to the Beam community.
>
> I learned a lot about Beam's internals and fundamental concepts while
> working on the Rust SDK (See [4]
> <https://github.com/apache/beam/compare/master...nivaldoh:beam:rust_sdk>
> for my commits), and I think this knowledge would give me a nice headstart
> to work with the ML transforms. Briefly speaking, I also have some
> experience working with Beam professionally (See [5]
> <https://github.com/google/megalista/pull/12>), and I have two official
> Google Cloud certifications (Professional Data Engineer and Professional ML
> Engineer). I have a bachelor's degree in CS, and there's a chance I might
> start a Master's degree program in CS/AI this summer/fall (pending
> university decisions).
>
> ***Questions***
> 1. Would I actually be eligible to apply to GSOC for this project, or do I
> not count as an open source beginner anymore in this case? The total number
> of PRs and issues I've ever opened on Github would be below 10 as far as
> I'm aware. I've never worked formally as a Software Engineer, so I'd have a
> lot to learn from a mentor and would be looking forward to that.
>
> 2. I'd like to understand the scope and exact purpose of the use cases a
> bit better. Are they meant to serve more like standalone tutorials with
> purely mock data, or maybe more like reusable/adaptable examples where
> users can fit in their own data? Additionally, is my assessment correct
> that the implementation would consist basically of actual code, testing and
> documentation?
>
> 3. Would it be possible to define what exactly would count as a "slowly
> changing source" for the purposes of the Enrichment use cases to be
> implemented?
>
> 4. Regarding the implementation of 1 or more additional Enrichment
> handlers for currently unsupported sources, we'd be looking into adding,
> for instance, something like a BigQueryEnrichmentHandler, is that correct?
>
>
> Thank you for reading this.
>
>
> ***References***
>
> [1]: https://github.com/apache/beam/pull/23879
>
> [2]: https://github.com/apache/beam/issues/21089
>
> [3]:
> https://developers.google.com/open-source/gsoc/faq#how_do_i_know_if_i_am_considered_a_beginner_in_open_source_development
>
> [4]:
> https://github.com/apache/beam/compare/master...nivaldoh:beam:rust_sdk
>
> [5]: https://github.com/google/megalista/pull/12
>

Re: [GSOC] Build out Beam Use Cases

Reply via email to