On Wed, Jul 27, 2022 at 1:39 PM Chamikara Jayalath <chamik...@google.com> wrote:
> > > On Wed, Jul 27, 2022 at 11:10 AM Lina Mårtensson <lina@camus.energy> > wrote: > >> Thanks Cham! >> >> Could you provide some more detail on your preference for developing a >> Python wrapper rather than implementing a source purely in Python? >> > > I've mentioned the main advantages of developing a cross-language > transform over natively implementing this in Python below. > > * Reduced cost of development > > It's much easier to develop a cross-language wrapper of the Java source > than re-implementing the source in Python. Sources are some of the most > complex > code we have in Beam and sources control the parallelization of the > pipeline (for example, splitting and dynamic work rebalancing for supported > runners). So getting this code wrong can result in hard to track data > loss/duplication related issues. > Additionally, based on my experience, it's very hard to get a source > implementation correct and performant on the first try. It could take > additional benchmarks/user feedback over time to get the source production > ready. > Java BT source is already battle tested well (actually we have two Java > implementations [1][2] currently). So I would rather use a Java BT > connector as a cross-language transform than re-implementing sources for > other SDKs. > > * Minimal maintenance cost > > Developing a source/sink is just a part of the story. We (as a community) > have to maintain it over time and make sure that ongoing issues/feature > requests are adequately handled. In the past, we have had cases where > sources/sinks are available for multiple SDKs but one > is significantly better than others when it comes to the feature set (for > example, BigQuery). Cross-language will make this easier and will allow us > to maintain key logic in a single place. > Also, a shameless plug for my Beam Summit video on the subject :) - https://www.youtube.com/watch?v=bt5DMP9Cwz0 > > >> >> If I look at the instructions for using the x-language Spanner >> connector, then using this - from the user's perspective - would >> involve installing a Java runtime. >> That's not terrible, but I fear that getting this to work with bazel >> might end up being more trouble than expected. (That has often >> happened here, and we have enough trouble with getting Python 3.9 and >> 3.10 to co-exist.) >> > > From an end user perspective, all they should have to do is make sure that > Java is available in the machine where the job is submitted from. Beam has > features to allow starting up cross-language expansion services (that is > needed during job submission) automatically so users should not have to do > anything other than that. > > At job execution, Beam (portable) uses Docker-based SDK harness containers > and we already release appropriate containers for each SDK. The runners > should seamlessly download containers needed to execute the job. > > That said, the main downside of cross-language today is runner support. > Cross-language transform support is only available for portable Beam > runners (for example, Dataflow Runner v2) but this is the direction Beam > runners are going anyway. > > >> >> There are a few of us at our small start-up that have written >> MapReduces and similar in the past and are completely convinced by the >> Beam/Dataflow model. But many others have no previous experience and >> are skeptical, and see this new tool we're introducing as something >> that's more trouble than it's worth, and something they'd rather avoid >> - even when we see how lots of their use cases could be made much >> easier using Beam. I'm worried that every extra hoop to jump through >> will make it less likely to be widely used for us. Because of that, my >> bias would be towards having a Python connector rather than >> x-language, and I would find it really helpful to learn about why you >> both favor the x-language option. >> > > I understand your concerns. It's certainly possible to develop the same > connector in multiple SDKs (and we provide SDF source framework support in > all SDK languages). But hopefully my comments above will give you an idea > of the downsides of this approach :). > > Thanks, > Cham > > [1] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java > [2] https://cloud.google.com/bigtable/docs/hbase-dataflow-java > > >> >> Thanks! >> -Lina >> >> On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath <chamik...@google.com> >> wrote: >> > >> > >> > >> > On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev < >> dev@beam.apache.org> wrote: >> >> >> >> Hi dev, >> >> >> >> We're starting to incorporate BigTable in our stack and I've delighted >> >> my co-workers with how easy it was to create some BigTables with >> >> Beam... but there doesn't appear to be a reader for BigTable in >> >> Python. >> >> >> >> First off, is there a good reason why not/any reason why it would be >> difficult? >> > >> > >> > There's was a previous effort to implement a Python BT source but that >> was not completed: >> https://github.com/apache/beam/pull/11295#issuecomment-646378304 >> > >> >> >> >> >> >> I could write one, but before I start, I'd love some input to make it >> easier. >> >> >> >> It appears that there would be two options: either write one in >> >> Python, or try to set one up with x-language from Java which I see is >> >> done e.g. with the Spanner IO Connector. >> >> Any recommendation on which one to pick or potential pitfalls in >> either choice? >> >> >> >> If I write one in Python, what should I think about? >> >> It is not obvious to me how to achieve parallelization, so any tips >> >> here would be welcome. >> > >> > >> > I would strongly prefer developing a Python wrapper for the existing >> Java BT source using Beam's Multi-language Pipelines framework over >> developing a new Python source. >> > >> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines >> > >> > Thanks, >> > Cham >> > >> > >> >> >> >> >> >> Thanks! >> >> -Lina >> >