lostluck commented on issue #32498: URL: https://github.com/apache/beam/issues/32498#issuecomment-2364528431
OK, definitely works well for me, but I am also on Google's network, in Seattle. It certainly must be made to work smoothly for folks who *aren't* in my specific unlikely situation. Adding a bit more debugging tells me the following: * (~200ms) Time to list the files from the service. Since this transform doesn't split, it isn't affected by the current policy. Actual file reading/opening are in a different bundle. * (~200us) Time from Start bundle to get to ProcessElement. Negligible. * (~100ms) Time to actually open the file for reading. The current Default Split policy for Prism is to only ask for progress and similar every ~100ms, and if there has been *any* progress either by the channel counter, or downstream element emissions, then it *will not split*. This allows it to split when processing is slow (indicated by ~100-200ms where the counts have not moved). Setting the progress ticker to ~ 10ms gives me similar behavior as the reports (Which gives me the chance to find something that should work.) The split planning is so simple, it's not taking into account other work that has been previously done. So it's always only waiting a fixed interval for work for a given stage. A more robust view would take into account work "globally" on the job, and only split if a stage is "straggling" or similar, but prism shouldn't go that far at this time. And we don't want to slow down *all* stages just because one needs to be more conservative in how it splits. I'm now trying out adding a "back off", for a given stage. If a split needs to happen, the rate of progress requests (and split decisions) happens slower for all new stages. If stages finish faster than any progress requests, then they are made to go faster again. So this should even out to some "ideal" rate per stage. But for this issue, a few "quick" splits should happen and then the aggression is toned down enough for work to complete properly. This isn't likely to be the final dynamic splitting decision approach, since it would be best for that to be also tied to the rate of input to output and similar. Combined with a better initial splits of data would probably solve most problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org