Alexey Kudinkin created HUDI-3397:
-------------------------------------

             Summary: Make sure Spark RDDs triggering actual FS activity are 
only dereferenced once
                 Key: HUDI-3397
                 URL: https://issues.apache.org/jira/browse/HUDI-3397
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin


Currently, RDD `collect()` operation is treated quite loosely and there are 
multiple flows which used to dereference RDDs (for ex, through `collect`, 
`count`, etc) that way triggering the same operations being carried out 
multiple times, occasionally duplicating the output already persisted on FS.

Check out HUDI-3370 for recent example.

NOTE: Even though Spark caching is supposed to make sure that we aren't writing 
to FS multiple times, we can't solely rely on caching to guarantee exactly once 
execution.

Instead, we should make sure that RDDs are only dereferenced {*}once{*}, w/in 
"commit" operation and all the other operations are only relying on 
_derivative_ data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to