I think it would be best for the author of this commit to respond, as there may have been other reasons for this change.
On Mon, Jul 9, 2018 at 8:20 AM Vojtech Janota <vojta.jan...@gmail.com> wrote: > Hi Reuven, > > I'm not really complaining about DirectRunner. In fact it seems to me as > if what previously was considered as part of the "expensive extra checks" > done by the DirectRunner is now done within the beam-runners-core-java > library. Considering that all objects involved are immutable (in our case > at least) and simple assignment is sufficient, the > serialization-deserialization really seems as unwanted and hugely expensive > correctness check. If there was a problem with identity copy, wasn't > DirectRunner supposed to reveal it? > > Regards, > Vojta > > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax <re...@google.com> wrote: > >> Hi Vojita, >> >> One problem is that the DirectRunner is designed for testing, not for >> performance. The DirectRunner currently does many purposely-inefficient >> things, the point of which is to better expose potential bugs in tests. For >> example, the DirectRunner will randomly shuffle the order of PCollections >> to ensure that your code does not rely on ordering. All of this adds cost, >> because the current runner is designed for testing. There have been >> requests in the past for an "optimized" local runner, however we don't >> currently have such a thing. >> >> In this case, using coders to clone values is more correct. In a >> distributed environment using encode/decode is the only way to copy values, >> and the DirectRunner is trying to ensure that your code is correct in a >> distributed environment. >> >> Reuven >> >> On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota <vojta.jan...@gmail.com> >> wrote: >> >>> Hi, >>> >>> We are using Apache Beam in our project for some time now. Since our >>> datasets are of modest size, we have so far used DirectRunner as the >>> computation easily fits onto a single machine. Recently we upgraded Beam >>> from 2.2 to 2.4 and found out that performance of our pipelines drastically >>> deteriorated. Pipelines that took ~3 minutes with 2.2 do not finish within >>> hours now. We tried to isolate the change that causes the slowdown and came >>> to the commits into the "InMemoryStateInternals" class: >>> >>> * https://github.com/apache/beam/commit/32a427c >>> * https://github.com/apache/beam/commit/8151d82 >>> >>> In a nutshell where previously the copy() method simply assigned: >>> >>> that.value = this.value >>> >>> There is now coder encode/decode combo hidden behind: >>> >>> that.value = uncheckedClone(coder, this.value) >>> >>> Can somebody explain the purpose of this change? Is it meant as an >>> additional "enforcement" point, similar to DirectRunner's >>> enforceImmutability and enforceEncodability? Or is it something that is >>> genuinely needed to provide correct behaviour of the pipeline? >>> >>> Any hints or thoughts are appreciated. >>> >>> Regards, >>> Vojta >>> >>> >>> >>> >>> >>> >