Re: Thoughts About Object Reuse and Collection Execution

Stephan Ewen Mon, 02 Mar 2015 08:23:02 -0800

@Ted  Here is a bit of background about how things are currently done in
the Flink runtime:


There are two execution modes for the runtime: "reuse" and "non-reuse".
  - The "non-reuse" mode will create new objects for every record received
from the network, or taken out of a sort-buffer or hash-table.
     It has the advantage that is some user code group operation
(groupReduce) materializes the group, it works simply (dedicated objects
for each element)

  - The "reuse" mode will have one or two objects that will be reused each
time a record is received from the network, or taken out of a sort-buffer
or hash-table.
    It behaves similar as the object reuse in Hadoop and saves in garbage
collection, but requires the user code to sometimes be aware of object
reuse

Flink has multiple runtime backends that can execute a program. Some do not
support both modes:

  - Flink's own runtime. Works in both "reuse" and "non-reuse" mode,
depending on what users select.

  - Java Collections (non-parallel, for testing and lightweight embedding).
Tolerates objects reuse in user code but does not attempt to reuse by
itself.

  - Tez (coming up) will in the long run support also both modes (reuse and
non-reuse

Where the internal algorithms (sorting/ hashing) do not affect any user
code behavior, they always work in "reusing" mode.

Greetings,
Stephan




On Sat, Feb 28, 2015 at 10:33 AM, Aljoscha Krettek <[email protected]>
wrote:

> @Stephan, they are not copied when object reuse is enabled. This might
> be a problem, though, so maybe we should just change it in that place.
>
> On Sat, Feb 28, 2015 at 7:57 AM, Ted Dunning <[email protected]>
> wrote:
> > This is going to have profound performance implications if this is the
> only
> > path for iteration.
> >
> >
> >
> > On Fri, Feb 27, 2015 at 10:58 PM, Stephan Ewen <[email protected]> wrote:
> >
> >> I vote to have the key extractor return a new value each time. That
> means
> >> that objects are not reused everywhere where it is possible, but still
> in
> >> most places, which still helps.
> >>
> >> What still puzzles me: I thought that the collection execution stores
> >> copies of the returned records by default (reuse safe mode).
> >> Am 27.02.2015 15:36 schrieb "Aljoscha Krettek" <[email protected]>:
> >>
> >> > Hello Nation of Flink,
> >> > while figuring out this bug:
> >> > https://issues.apache.org/jira/browse/FLINK-1569
> >> > I came upon some difficulties. The problem is that the
> >> > KeyExtractorMappers always
> >> > return the same tuple. This is problematic, since Collection Execution
> >> > does simply store the returned values in a list. These elements are
> >> > not copied before they are stored when object reuse is enabled.
> >> > Therefore, the whole list will contain only that one reused element.
> >> >
> >> > I see two options for solving this:
> >> > 1. Change KeyExtractorMappers to always return a new tuple, thereby
> >> > making object-reuse mode in cluster execution useless for key
> >> > extractors.
> >> >
> >> > 2. Change collection execution mapper to always make copies of the
> >> > returned elements. This would make object-reuse in collection
> >> > execution pretty much obsolete, IMHO.
> >> >
> >> > How should we proceed with this?
> >> >
> >> > Cheers,
> >> > Aljoscha
> >> >
> >>
>

Re: Thoughts About Object Reuse and Collection Execution

Reply via email to