[jira] [Updated] (HADOOP-18842) Support Overwrite Directory On Commit For S3A Committers

2023-08-25 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-18842:

Parent: HADOOP-18477
Issue Type: Sub-task  (was: New Feature)

> Support Overwrite Directory On Commit For S3A Committers
> 
>
> Key: HADOOP-18842
> URL: https://issues.apache.org/jira/browse/HADOOP-18842
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.4.0
>Reporter: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> The goal is to add a new kind of commit mechanism in which the destination 
> directory is cleared off before committing the file.
> *Use Case*
> In case of dynamicPartition insert overwrite queries, The destination 
> directory which needs to be overwritten are not known before the execution 
> and hence it becomes a challenge to clear off the destination directory.
>  
> One approach to handle this is, The underlying engines/client will clear off 
> all the destination directories before calling the commitJob operation but 
> the issue with this approach is that, In case of failures while committing 
> the files, We might end up with the whole of previous data being deleted 
> making the recovery process difficult or time consuming.
>  
> *Solution*
> Based on mode of commit operation either *INSERT* or *OVERWRITE* , During 
> commitJob operations, The committer will map each destination directory with 
> the commits which needs to be added in the directory and if the mode is 
> *OVERWRITE* , The committer will delete the directory recursively and then 
> commit each of the files in the directory. So in case of failures (worst 
> case) The number of destination directory which will be deleted will be equal 
> to the number of threads if we do it in multi-threaded way as compared to the 
> whole data if it was done in the engine side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-18842) Support Overwrite Directory On Commit For S3A Committers

2023-08-08 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-18842:

Affects Version/s: 3.4.0

> Support Overwrite Directory On Commit For S3A Committers
> 
>
> Key: HADOOP-18842
> URL: https://issues.apache.org/jira/browse/HADOOP-18842
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Affects Versions: 3.4.0
>Reporter: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> The goal is to add a new kind of commit mechanism in which the destination 
> directory is cleared off before committing the file.
> *Use Case*
> In case of dynamicPartition insert overwrite queries, The destination 
> directory which needs to be overwritten are not known before the execution 
> and hence it becomes a challenge to clear off the destination directory.
>  
> One approach to handle this is, The underlying engines/client will clear off 
> all the destination directories before calling the commitJob operation but 
> the issue with this approach is that, In case of failures while committing 
> the files, We might end up with the whole of previous data being deleted 
> making the recovery process difficult or time consuming.
>  
> *Solution*
> Based on mode of commit operation either *INSERT* or *OVERWRITE* , During 
> commitJob operations, The committer will map each destination directory with 
> the commits which needs to be added in the directory and if the mode is 
> *OVERWRITE* , The committer will delete the directory recursively and then 
> commit each of the files in the directory. So in case of failures (worst 
> case) The number of destination directory which will be deleted will be equal 
> to the number of threads if we do it in multi-threaded way as compared to the 
> whole data if it was done in the engine side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-18842) Support Overwrite Directory On Commit For S3A Committers

2023-08-08 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-18842:

Component/s: fs/s3

> Support Overwrite Directory On Commit For S3A Committers
> 
>
> Key: HADOOP-18842
> URL: https://issues.apache.org/jira/browse/HADOOP-18842
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3
>Reporter: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> The goal is to add a new kind of commit mechanism in which the destination 
> directory is cleared off before committing the file.
> *Use Case*
> In case of dynamicPartition insert overwrite queries, The destination 
> directory which needs to be overwritten are not known before the execution 
> and hence it becomes a challenge to clear off the destination directory.
>  
> One approach to handle this is, The underlying engines/client will clear off 
> all the destination directories before calling the commitJob operation but 
> the issue with this approach is that, In case of failures while committing 
> the files, We might end up with the whole of previous data being deleted 
> making the recovery process difficult or time consuming.
>  
> *Solution*
> Based on mode of commit operation either *INSERT* or *OVERWRITE* , During 
> commitJob operations, The committer will map each destination directory with 
> the commits which needs to be added in the directory and if the mode is 
> *OVERWRITE* , The committer will delete the directory recursively and then 
> commit each of the files in the directory. So in case of failures (worst 
> case) The number of destination directory which will be deleted will be equal 
> to the number of threads if we do it in multi-threaded way as compared to the 
> whole data if it was done in the engine side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-18842) Support Overwrite Directory On Commit For S3A Committers

2023-08-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HADOOP-18842:

Labels: pull-request-available  (was: )

> Support Overwrite Directory On Commit For S3A Committers
> 
>
> Key: HADOOP-18842
> URL: https://issues.apache.org/jira/browse/HADOOP-18842
> Project: Hadoop Common
>  Issue Type: New Feature
>Reporter: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> The goal is to add a new kind of commit mechanism in which the destination 
> directory is cleared off before committing the file.
> *Use Case*
> In case of dynamicPartition insert overwrite queries, The destination 
> directory which needs to be overwritten are not known before the execution 
> and hence it becomes a challenge to clear off the destination directory.
>  
> One approach to handle this is, The underlying engines/client will clear off 
> all the destination directories before calling the commitJob operation but 
> the issue with this approach is that, In case of failures while committing 
> the files, We might end up with the whole of previous data being deleted 
> making the recovery process difficult or time consuming.
>  
> *Solution*
> Based on mode of commit operation either *INSERT* or *OVERWRITE* , During 
> commitJob operations, The committer will map each destination directory with 
> the commits which needs to be added in the directory and if the mode is 
> *OVERWRITE* , The committer will delete the directory recursively and then 
> commit each of the files in the directory. So in case of failures (worst 
> case) The number of destination directory which will be deleted will be equal 
> to the number of threads if we do it in multi-threaded way as compared to the 
> whole data if it was done in the engine side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org