[ 
https://issues.apache.org/jira/browse/BEAM-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856777#comment-16856777
 ] 

Yifan Mai commented on BEAM-529:
--------------------------------

Sorry, I haven't captured the proposal on JIRA yet.

The general idea to have DoFnRunner hash each input element (or some sample of 
input elements) before and after the DoFn is run. If the hashes differ, then 
the input element was mutated and the pipeline should return an error.

The problem is that does not actually have the semantics we want. See 
https://docs.python.org/3/reference/datamodel.html#object.__hash__

# Not all objects are hashable. For instance mutable containers like lists are 
unhashable.
# User defined classes are hashable by default, but the default hash is simply 
the id of the object, rather than its contents.

I've tried some workarounds such as:

# Convert unhashable containers to immutable hashable containers before hashing 
them
# Traverse into the __attr__ of user defined classes and hash the elements

Even so, there are user defined classes that still break under this scheme. For 
instance, pandas DataFrame has properties that, when read, modifies a cache 
that is stored as a parameter. This scheme will treat the cache modification as 
a mutation and incorrectly raise a false positive.

As such, I haven't come up with a way to do this in a way that is robust enough 
to cover all conceivable user code.

> Check immutability violations in DirectPipelineRunner
> -----------------------------------------------------
>
>                 Key: BEAM-529
>                 URL: https://issues.apache.org/jira/browse/BEAM-529
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-py-core
>            Reporter: Ahmet Altay
>            Priority: Minor
>              Labels: newbie, starter
>
> Users are going to mutate inputs and outputs of DoFn inappropriately. We 
> should help their tests fail to catch such mistakes. (Similar to the 
> DirectPipelineRunner in Java SDK)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to