Hi Ismael, Those are good points. Do you know if the Interactive Runner has been tried in those instances? If so, what were the shortcomings?
I can also see the use of sampling for a performance benchmarking reason. We have seen others send in known elements which are tracked throughout the pipeline to generate timings for each transform/stage. -Sam On Fri, Dec 18, 2020 at 8:24 AM Ismaël Mejía <[email protected]> wrote: > Hello, > > The use of direct runner for interactive local use cases has increased > with the years on Beam due to projects like Scio, Kettle/Hop and our > own SQL CLI. All these tools have in common one thing, they show a > sample of some source input to the user and interactively apply > transforms to it to help users build Pipelines more rapidly. > > If you build a pipeline today to produce this sample using the Beam’s > Sample transform from a set of files, the read of the files happens > first and then the sample, so the more files or the bigger they are > the longer it takes to produce the sample even if the number of > elements expected to read is constant. > > During Beam Summit last year there were some discussions about how we > could improve this scenario (and others) but I have the impression no > further discussions happened in the mailing list, so I wanted to know > if there are some ideas about how we can get direct runner to improve > this case. > > It seems to me that we can still ‘force’ the count with some static > field because it is not a distributed case but I don’t know how we can > stop reading once we have the number of sampled elements in a generic > way, specially now it seems to me a bit harder to do with pure DoFn > (SDF) APIs vs old Source ones, but well that’s just a guess. > > Does anyone have an idea of how could we generalize this and of course > if you see the value of such use case, other ideas for improvements? > > Regards, > Ismaël >
