Hey!
Just so I understand this correctly then, what does the following quote
from [1], section 3.2.3 mean:
A PCollection is immutable. Once created, you cannot add, remove, or
change individual elements. A Beam Transform might process each element
of a PCollection and generate new pipeline data (as a new PCollection),
*but it does not consume or modify the original input collection.*
(Don't know what the normal way of highlighting is on mailing lists, so
I just put it between *)
I read this as meaning that it is the users responsibilty to make sure
that their transformations do not modify the input, but should I rather
read it as meaning the beam runner itself should make sure the user
cannot make such a mistake? I find this reading at odds with the
documentation about the direct runner and it's express purpose being to
make sure users doesn't rely on semantics the beam model doesn't ensure.
And modifying of input arguments being one of the constraints listed.
[2].
It doesn't change the outcome here, adding an opt out switch, but if
I've missunderstood the quote above, I think this might benefit by being
reworded, to make sure it is communicated that shooting yourself in the
foot is impossible and the direct runner testing of modifying input
should be removed, as there is no point in users making sure to not
modifying the input if all runners guarantee it.
Also, I ran the whole Flink test suite with a simple return instead of
the deep copy and all tests passed, so there is no such test in there.
Depending on the reading above, we should add such tests to all runners.
Best regards,
Teodor Spæren
On Thu, Oct 29, 2020 at 10:16:30AM +0100, Maximilian Michels wrote:
Ok then we are on the same page, but I disagree with your
conclusion. The reason Flink has to do the deep copy is that it
doesn't state that the inputs are immutable and should not be
changed, and so have to do the deep copy. In Beam, the user is not
supposed to modify the input collection and if they do, it's
undefined behavior. This is the reason the DirectRunner checks for
this, to make sure the users are not relying on it.
It's not written anywhere that the input cannot be mutated. A
DirectRunner test is not a proof. Any runner could add a test which
proves the opposite. In fact we may have one that checks copying for
Flink.
I prefer safety and correctness over performance because I've seen too
many cases where users shoot themselves in the foot. We should make
sure that, by default, the user cannot modify the input element. An
option to disable that is fine.
-Max