Hello,

The use of direct runner for interactive local use cases has increased
with the years on Beam due to projects like Scio, Kettle/Hop and our
own SQL CLI. All these tools have in common one thing, they show a
sample of some source input to the user and interactively apply
transforms to it to help users build Pipelines more rapidly.

If you build a pipeline today to produce this sample using the Beam’s
Sample transform from a set of files, the read of the files happens
first and then the sample, so the more files or the bigger they are
the longer it takes to produce the sample even if the number of
elements expected to read is constant.

During Beam Summit last year there were some discussions about how we
could improve this scenario (and others) but I have the impression no
further discussions happened in the mailing list, so I wanted to know
if there are some ideas about how we can get direct runner to improve
this case.

It seems to me that we can still ‘force’ the count with some static
field because it is not a distributed case but I don’t know how we can
stop reading once we have the number of sampled elements in a generic
way, specially now it seems to me a bit harder to do with pure DoFn
(SDF) APIs vs old Source ones, but well that’s just a guess.

Does anyone have an idea of how could we generalize this and of course
if you see the value of such use case, other ideas for improvements?

Regards,
Ismaël

Reply via email to