Hi Ismael,

Those are good points. Do you know if the Interactive Runner has been tried
in those instances? If so, what were the shortcomings?

I can also see the use of sampling for a performance benchmarking reason.
We have seen others send in known elements which are tracked throughout the
pipeline to generate timings for each transform/stage.

-Sam

On Fri, Dec 18, 2020 at 8:24 AM Ismaël Mejía <[email protected]> wrote:

> Hello,
>
> The use of direct runner for interactive local use cases has increased
> with the years on Beam due to projects like Scio, Kettle/Hop and our
> own SQL CLI. All these tools have in common one thing, they show a
> sample of some source input to the user and interactively apply
> transforms to it to help users build Pipelines more rapidly.
>
> If you build a pipeline today to produce this sample using the Beam’s
> Sample transform from a set of files, the read of the files happens
> first and then the sample, so the more files or the bigger they are
> the longer it takes to produce the sample even if the number of
> elements expected to read is constant.
>
> During Beam Summit last year there were some discussions about how we
> could improve this scenario (and others) but I have the impression no
> further discussions happened in the mailing list, so I wanted to know
> if there are some ideas about how we can get direct runner to improve
> this case.
>
> It seems to me that we can still ‘force’ the count with some static
> field because it is not a distributed case but I don’t know how we can
> stop reading once we have the number of sampled elements in a generic
> way, specially now it seems to me a bit harder to do with pure DoFn
> (SDF) APIs vs old Source ones, but well that’s just a guess.
>
> Does anyone have an idea of how could we generalize this and of course
> if you see the value of such use case, other ideas for improvements?
>
> Regards,
> Ismaël
>

Reply via email to