lostluck commented on PR #26042:
URL: https://github.com/apache/beam/pull/26042#issuecomment-1556418850

   > I've been pondering `ProcessElement` in read_batch.go and the performance 
impact of re-reading rows across splits. Are there any other different split 
strategies that might work better or do you think the current approach is ok?
   
   So, this is a complicated question. As a rule, you're right. Re-reading and 
discarding repeatedly is likely to be a source of inefficiency. But that would 
require the runner in question to decide that the elements need sub splitting 
in the first place. If there's sufficient parallelism, it might not happen.
   
   A runner that doesn't do dynamic splitting  might decide to eagerly have 
each eleemnt split, guaranteeing the bad behavior. A runner that can do dynamic 
splitting, might never split at that granularity, and never need to. BUT, if 
one of the rows is taking much longer to process, then it might want to have a 
given element split even at the cost of the re-read of rows.
   
   So, arguably the way to avoid it is we move getting the batch partitions 
into CreateInitialRestriction, which complicates the restriction into 
containing both a offset range, and a slice of the transactions. Then the 
"element" that's getting split is just the query directly. The splitRestriction 
can then make smaller version that contain only a single partition each, 
instead of the offset. And finally, if a split request happens at *that* level, 
we can then move to using offset ranges and similar against the partitioned 
queries.
   
   This avoids the rereading problem by default by only having initial splits 
be equal to the natural spanner partitioning, and only allows rereading when a 
sub element split needs to happen.
   
   But TBH that's defintely something worth doing in another PR, where we can 
benchmark this solution, and then be able to compare it against any new 
approach first.
   
   -------------
   
   At this point, I'm going to be merging this in once the PostCommit passes 
(which is the Dataflow test set). If the current run fails, I'll filter out the 
test accordingly first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to