Hi Teodor, I actually forward your message to dev@ before, but I foolishly removed user@ from the distro so I think you weren't able to see it. Sorry about that!
+Lukasz Cwik <[email protected]> replied there [1]. I'll copy it here and we can keep discussing on this thread: The idea in Beam has always been to make objects passed between transforms to not be mutable and require users to either pass through their object or output a new object. Minimizing the places where Flink performs copying would be best but without further investigation wouldn't be able to give concrete suggestions. Brian [1] https://lists.apache.org/thread.html/r8c22c8b089f9caaac8efef90e62117a1db49af6471ff6bd7cbc5b882%40%3Cdev.beam.apache.org%3E On Mon, Aug 31, 2020 at 11:14 AM Teodor Spæren <[email protected]> wrote: > Hey! > > First time posting to a mailing list, hope I did it correctly :) > > I'm writing a master thesis at the University of Oslo and right now I'm > looking at the performance overhead of using Beam with the Flink runnner > versus plain Flink. > > I've written a simple program, a custom source outputing 0, 1, 2, 3, up > to N, going into a single identity operator and then int a filter which > only matches N and prints that out. This is just to compare performance. > > I've been doing some profiling of simple programs and one observation is > the performance difference in the serialization. The hotspot is [1], > which is used multiple places, but one place is [2], which is called > from [3]. As far as I can tell, [1] seems to be implementing copying by > first serializing and then deserializing and there are no way for the > actual types to change this. In flink, you have control over the copy() > method, like in [4] and so for certain types you can just do a simple > return as you do here. > > My queston is if I've understood the flow correctly so far and if so > what the reason for doing it this way. Is it to avoid demanding that the > type implement some type of cloning? And would it be possible to push > this downward in the stack and allow the encoders to do define the copy > schemantics? I'm willing to do the work here, just want to know if it > would work on an arcitectural level. > > If there is any known overheads of using beam that you would like to > point out, I would love to hear about it. > > Best regards, > Teodor Spæren > > [1]: > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L140 > [2]: > https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/runners/flink/1.8/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeSerializer.java#L85 > [3]: > https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeInformation.java#L85 > [4]: > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/LongSerializer.java#L53 >
