[
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332798#comment-17332798
]
Vinoth Chandar commented on HUDI-1138:
--------------------------------------
[~balajeeUber] We also block any running tasks from creating new files, once a
commit is about to be finalized. This will be a more general solution that is
very nice, since we will be using timeline server anyway.
cc [~nagarwal] [~nagarwal]
[~guoyihua] also expressed interested in doing this btw.
So lets please decide soon, who is going to take this :)
> Re-implement marker files via timeline server
> ---------------------------------------------
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Writer Core
> Affects Versions: 0.9.0
> Reporter: Vinoth Chandar
> Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for
> deleting partial files written due to spark task failures/stage retries. It
> will still leave extra files inside the table (and users will pay for it
> every month) and we need the marker mechanism to be able to delete these
> partial files.
> Here we explore if we can improve the current marker file mechanism, that
> creates one marker file per data file written, by
> Delegating the createMarker() call to the driver/timeline server, and have it
> create marker metadata into a single file handle, that is flushed for
> durability guarantees
>
> P.S: I was tempted to think Spark listener mechanism can help us deal with
> failed tasks, but it has no guarantees. the writer job could die without
> deleting a partial file. i.e it can improve things, but cant provide
> guarantees
--
This message was sent by Atlassian Jira
(v8.3.4#803005)