[ https://issues.apache.org/jira/browse/BEAM-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760074#comment-15760074 ]
Kenneth Knowles commented on BEAM-1164: --------------------------------------- I think with the new {{DoFn}} there is a fairly elegant solution here. Today we write: {code} new DoFn<Foo, Baz>() { @ProcessElement public void processElem(ProcessContext ctx) { ... ctx.element() ... } } {code} We'd like to allow the user to request only the element, both for clarity and for potential optimization, as in {code} new DoFn<Foo, Baz>() { @ProcessElement public void processElem(Element elem) { ... elem.get() ... } } {code} where {{Element}} is a distinguished inner class, to avoid repeating verbose input types. >From here, it is a short step to saying that you want a mutable element: {code} new DoFn<Foo, Baz>() { @ProcessElement public void processElem(MutableElement elem) { ... elem.get().setBizzle(...) ... } } {code} At the level of the "JSON" runner API, we will need to tag the user-defined function with the fact that it intends to mutate its input. The Java SDK will analyze the method signature, as usual, to discern this automatically. A runner will then be free to decide between disabling optimizations or cloning elements when necessary. > Allow a DoFn to opt in to mutating it's input > --------------------------------------------- > > Key: BEAM-1164 > URL: https://issues.apache.org/jira/browse/BEAM-1164 > Project: Beam > Issue Type: Bug > Components: beam-model > Reporter: Frances Perry > Priority: Minor > > Runners generally can't tell if a DoFn is mutating inputs, but assuming so by > default leads to significant performance implications from unnecessary > copying (around sibling fusion, etc). So instead the model prevents mutating > inputs, and the Direct Runner validates this behavior. (See: > http://beam.incubator.apache.org/contribute/design-principles/#make-efficient-things-easy-rather-than-make-easy-things-efficient) > > However, if users are processing a small number of large records by making > incremental changes (for example, genomics use cases), the cost of > immutability requirement can be very large. As a workaround, users sometimes > do suboptimal things (fusing ParDos by hand) or undefined things when they > expect the immutability requirement is unnecessarily strict (adding no-op > coders in places they hope the runner won't be materializing things, mutating > things anyway when they don't expect sibling fusion to happen, etc). > We should consider adding a signal (MutatingDoFn?) that users explicitly opt > in to to say their code may mutate inputs. The runner can then use this > assumption to either prevent optimizations that would break in the face of > this or insert additional copies as needed to allow optimizations to preserve > semantics. > See this related user@ discussion: > https://lists.apache.org/thread.html/f39689f54147117f3fc54c498eff1a20fa73f1be5b5cad5b6f816fd3@%3Cuser.beam.apache.org%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)