Alexey Kudinkin created HUDI-3397:
-------------------------------------
Summary: Make sure Spark RDDs triggering actual FS activity are
only dereferenced once
Key: HUDI-3397
URL: https://issues.apache.org/jira/browse/HUDI-3397
Project: Apache Hudi
Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
Currently, RDD `collect()` operation is treated quite loosely and there are
multiple flows which used to dereference RDDs (for ex, through `collect`,
`count`, etc) that way triggering the same operations being carried out
multiple times, occasionally duplicating the output already persisted on FS.
Check out HUDI-3370 for recent example.
NOTE: Even though Spark caching is supposed to make sure that we aren't writing
to FS multiple times, we can't solely rely on caching to guarantee exactly once
execution.
Instead, we should make sure that RDDs are only dereferenced {*}once{*}, w/in
"commit" operation and all the other operations are only relying on
_derivative_ data.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)