[
https://issues.apache.org/jira/browse/FLINK-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139151#comment-15139151
]
Greg Hogan commented on FLINK-3333:
-----------------------------------
Apache Flink programs can be written and configured to reduce the number of
object allocations for better performance. User defined functions (like map()
or groupReduce()) process many millions or billions of input and output values.
Enabling object reuse and processing mutable objects improves performance by
lowering demand on the CPU cache and Java garbage collector.
Object reuse is disabled by default, with user defined functions generally
getting new objects on each call (or through an iterator). In this case it is
safe to store references to the objects inside the function (for example, in a
List).
<'storing values in a list' example>
Apache Flink will chain functions to improve performance when sorting is
preserved and the parallelism unchanged. The chainable operators are Map,
FlatMap, Reduce on full DataSet, GroupCombine on a Grouped DataSet, or a
GroupReduce where the user supplied a RichGroupReduceFunction with a combine
method). Objects are passed without copying _even when object reuse is
disabled_.
In the chaining case, the functions in the chain are receiving the same object
instances. So the the second map() function is receiving the objects the first
map() is returning. This behavior can lead to errors when the first map()
function keeps a list of all objects and the second mapper is modifying
objects. In that case, the user has to manually create copies of the objects
before putting them into the list.
<chainable example>
<discussion of copyable values>
<copyablevalue example>
There is a switch at the ExecutionConfig which allows users to enable the
object reuse mode (enableObjectReuse()). For mutable types, Flink will reuse
object instances. In practice that means that a user function will always
receive the same object instance (with its fields set to new values). The
object reuse mode will lead to better performance because fewer objects are
created, but the user has to manually take care of what they are doing with the
object references.
<object reuse example>
> Documentation about object reuse should be improved
> ---------------------------------------------------
>
> Key: FLINK-3333
> URL: https://issues.apache.org/jira/browse/FLINK-3333
> Project: Flink
> Issue Type: Bug
> Components: Documentation
> Affects Versions: 1.0.0
> Reporter: Gabor Gevay
> Assignee: Gabor Gevay
> Priority: Blocker
> Fix For: 1.0.0
>
>
> The documentation about object reuse \[1\] has several problems, see \[2\].
> \[1\]
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#object-reuse-behavior
> \[2\]
> https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg/edit
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)