lostluck commented on PR #24364:
URL: https://github.com/apache/beam/pull/24364#issuecomment-1331378193

   If it's the ["go native bigquery" 
source](https://pkg.go.dev/github.com/apache/beam/sdks/[email protected]/go/pkg/beam/io/bigqueryio)
 and not the [xlang bigquery 
source](https://pkg.go.dev/github.com/apache/beam/sdks/[email protected]/go/pkg/beam/io/xlang/bigqueryio)
 that's the problem. The Go Native bigqueryio doesn't currently scale, and has 
not been performance vetted. It's extremely old and needs a re-write for 
scalability, since it pre-dates the scaling mechanism. See below the break for 
explanations.
   
   If you can, try again with the [Xlang BigQuery 
IO](https://pkg.go.dev/github.com/apache/beam/sdks/[email protected]/go/pkg/beam/io/xlang/bigqueryio)
 which uses the Java one under the hood. It'll download jars and spin up an 
expansion service to add it to your pipeline. Assuming it works, it will do all 
the scaling and such as needed. If there are problems, please file an issue so 
we can fix them.
   
   --------
   
   So, because this is a sink, the Scaling would be constrained by whatever the 
source of the data is. So if you have a well authored SplittableDoFn as the 
source or a GBK, Dataflow can dynamically split.
   
   Eg. If you have
   Impulse -> Basic DoFn -> Write
   
   All of this will stay on a single machine, and scaling won't help (and it 
doesn't scale.)
   
   The basic workaround is the GBK.
   
   Impulse -> Basic DoFn -> GBK -> Write 
   
   Which allows the post GBK bit to scale based on Key parallelism.
   
   But if it's
   
   Impulse -> SplittableDoFn -> Write
   
   Then the runner (like Dataflow), can split and basically allow sharding of 
that work. This is how the current textio read works (the Write needs more work 
though).
   
   
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/textio/textio.go#L128
   
   Eg. The textio implements it's 
[read](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/textio/textio.go#L83)
 composite by first finding all the files for a glob, then getting the sizes 
for all of them (so it's filename + byte size pairs), and then uses a 
splittable DoFn so that each element there can be sub-split, based on the 
restrictions (eg. incase a single file is very very large). 
   
   You can learn more about [SDFs in the programming 
guide.](https://beam.apache.org/documentation/programming-guide/#splittable-dofns)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to