Re: BigTable reader for Python?

Chamikara Jayalath via dev Wed, 27 Jul 2022 13:39:50 -0700

On Wed, Jul 27, 2022 at 11:10 AM Lina Mårtensson <[email protected]> wrote:

> Thanks Cham!
>
> Could you provide some more detail on your preference for developing a
> Python wrapper rather than implementing a source purely in Python?
>

I've mentioned the main advantages of developing a cross-language transform
over natively implementing this in Python below.

* Reduced cost of development

It's much easier to  develop a cross-language wrapper of the Java  source
than re-implementing the source in Python. Sources are some of the most
complex
code we have in Beam and sources control the parallelization of the
pipeline (for example, splitting and dynamic work rebalancing for supported
runners). So getting this code wrong can result in hard to track data
loss/duplication related issues.
Additionally, based on my experience, it's very hard to get a source
implementation correct and performant on the first try. It could take
additional benchmarks/user feedback over time to get the source production
ready.
Java BT source is already battle tested well (actually we have two Java
implementations [1][2] currently). So I would rather use a Java BT
connector as a cross-language transform than re-implementing sources for
other SDKs.

* Minimal maintenance cost

Developing a source/sink is just a part of the story. We (as a community)
have to maintain it over time and make sure that ongoing issues/feature
requests are adequately handled. In the past, we have had cases where
sources/sinks are available for multiple SDKs but one
is significantly better than others when it comes to the feature set (for
example, BigQuery). Cross-language will make this easier and will allow us
to maintain key logic in a single place.

>
> If I look at the instructions for using the x-language Spanner
> connector, then using this - from the user's perspective - would
> involve installing a Java runtime.
> That's not terrible, but I fear that getting this to work with bazel
> might end up being more trouble than expected. (That has often
> happened here, and we have enough trouble with getting Python 3.9 and
> 3.10 to co-exist.)
>

>From an end user perspective, all they should have to do is make sure that
Java is available in the machine where the job is submitted from. Beam has
features to allow starting up cross-language expansion services (that is
needed during job submission) automatically so users should not have to do
anything other than that.

At job execution, Beam (portable) uses Docker-based SDK harness containers
and we already release appropriate containers for each SDK. The runners
should seamlessly download containers needed to execute the job.

That said, the main downside of cross-language today is runner support.
Cross-language transform support is only available for portable Beam
runners (for example, Dataflow Runner v2) but this is the direction Beam
runners are going anyway.

>
> There are a few of us at our small start-up that have written
> MapReduces and similar in the past and are completely convinced by the
> Beam/Dataflow model. But many others have no previous experience and
> are skeptical, and see this new tool we're introducing as something
> that's more trouble than it's worth, and something they'd rather avoid
> - even when we see how lots of their use cases could be made much
> easier using Beam. I'm worried that every extra hoop to jump through
> will make it less likely to be widely used for us. Because of that, my
> bias would be towards having a Python connector rather than
> x-language, and I would find it really helpful to learn about why you
> both favor the x-language option.
>

I understand your concerns. It's certainly possible to develop the same
connector in multiple SDKs (and we provide SDF source framework support in
all SDK languages). But hopefully my comments above will give you an idea
of the downsides of this approach :).

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java
[2] https://cloud.google.com/bigtable/docs/hbase-dataflow-java

>
> Thanks!
> -Lina
>
> On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath <[email protected]>
> wrote:
> >
> >
> >
> > On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev <
> [email protected]> wrote:
> >>
> >> Hi dev,
> >>
> >> We're starting to incorporate BigTable in our stack and I've delighted
> >> my co-workers with how easy it was to create some BigTables with
> >> Beam... but there doesn't appear to be a reader for BigTable in
> >> Python.
> >>
> >> First off, is there a good reason why not/any reason why it would be
> difficult?
> >
> >
> > There's was a previous effort to implement a Python BT source but that
> was not completed:
> https://github.com/apache/beam/pull/11295#issuecomment-646378304
> >
> >>
> >>
> >> I could write one, but before I start, I'd love some input to make it
> easier.
> >>
> >> It appears that there would be two options: either write one in
> >> Python, or try to set one up with x-language from Java which I see is
> >> done e.g. with the Spanner IO Connector.
> >> Any recommendation on which one to pick or potential pitfalls in either
> choice?
> >>
> >> If I write one in Python, what should I think about?
> >> It is not obvious to me how to achieve parallelization, so any tips
> >> here would be welcome.
> >
> >
> > I would strongly prefer developing a  Python wrapper for the existing
> Java BT source using Beam's Multi-language Pipelines framework over
> developing a new Python source.
> >
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
> >
> > Thanks,
> > Cham
> >
> >
> >>
> >>
> >> Thanks!
> >> -Lina
>

Re: BigTable reader for Python?

Reply via email to