lostluck commented on PR #26042: URL: https://github.com/apache/beam/pull/26042#issuecomment-1556418850
> I've been pondering `ProcessElement` in read_batch.go and the performance impact of re-reading rows across splits. Are there any other different split strategies that might work better or do you think the current approach is ok? So, this is a complicated question. As a rule, you're right. Re-reading and discarding repeatedly is likely to be a source of inefficiency. But that would require the runner in question to decide that the elements need sub splitting in the first place. If there's sufficient parallelism, it might not happen. A runner that doesn't do dynamic splitting might decide to eagerly have each eleemnt split, guaranteeing the bad behavior. A runner that can do dynamic splitting, might never split at that granularity, and never need to. BUT, if one of the rows is taking much longer to process, then it might want to have a given element split even at the cost of the re-read of rows. So, arguably the way to avoid it is we move getting the batch partitions into CreateInitialRestriction, which complicates the restriction into containing both a offset range, and a slice of the transactions. Then the "element" that's getting split is just the query directly. The splitRestriction can then make smaller version that contain only a single partition each, instead of the offset. And finally, if a split request happens at *that* level, we can then move to using offset ranges and similar against the partitioned queries. This avoids the rereading problem by default by only having initial splits be equal to the natural spanner partitioning, and only allows rereading when a sub element split needs to happen. But TBH that's defintely something worth doing in another PR, where we can benchmark this solution, and then be able to compare it against any new approach first. ------------- At this point, I'm going to be merging this in once the PostCommit passes (which is the Dataflow test set). If the current run fails, I'll filter out the test accordingly first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
