lostluck commented on PR #24364: URL: https://github.com/apache/beam/pull/24364#issuecomment-1331378193
If it's the ["go native bigquery" source](https://pkg.go.dev/github.com/apache/beam/sdks/[email protected]/go/pkg/beam/io/bigqueryio) and not the [xlang bigquery source](https://pkg.go.dev/github.com/apache/beam/sdks/[email protected]/go/pkg/beam/io/xlang/bigqueryio) that's the problem. The Go Native bigqueryio doesn't currently scale, and has not been performance vetted. It's extremely old and needs a re-write for scalability, since it pre-dates the scaling mechanism. See below the break for explanations. If you can, try again with the [Xlang BigQuery IO](https://pkg.go.dev/github.com/apache/beam/sdks/[email protected]/go/pkg/beam/io/xlang/bigqueryio) which uses the Java one under the hood. It'll download jars and spin up an expansion service to add it to your pipeline. Assuming it works, it will do all the scaling and such as needed. If there are problems, please file an issue so we can fix them. -------- So, because this is a sink, the Scaling would be constrained by whatever the source of the data is. So if you have a well authored SplittableDoFn as the source or a GBK, Dataflow can dynamically split. Eg. If you have Impulse -> Basic DoFn -> Write All of this will stay on a single machine, and scaling won't help (and it doesn't scale.) The basic workaround is the GBK. Impulse -> Basic DoFn -> GBK -> Write Which allows the post GBK bit to scale based on Key parallelism. But if it's Impulse -> SplittableDoFn -> Write Then the runner (like Dataflow), can split and basically allow sharding of that work. This is how the current textio read works (the Write needs more work though). https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/textio/textio.go#L128 Eg. The textio implements it's [read](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/textio/textio.go#L83) composite by first finding all the files for a glob, then getting the sizes for all of them (so it's filename + byte size pairs), and then uses a splittable DoFn so that each element there can be sub-split, based on the restrictions (eg. incase a single file is very very large). You can learn more about [SDFs in the programming guide.](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
