[I] S3 File Sink: Move Operation from Temp to Target Location Causes Significant Job Completion Delay with Large Number of Files [seatunnel]

via GitHub Tue, 20 May 2025 00:52:30 -0700


prashant462 opened a new issue, #9339:
URL: https://github.com/apache/seatunnel/issues/9339


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   When using the S3 file sink in SeaTunnel, I have observed that data is first 
written to a temporary directory (e.g., /tmp/seatunnel in the S3 bucket) and, 
during the commit phase, all files are moved to their final target location. 
This move/rename operation is performed on the driver node and can take hours 
to complete when there are a large number of files, significantly increasing 
the overall job completion time.
   Currently, there is no configuration option to bypass this two-phase commit 
process and write files directly to the target location. The move/rename step 
is hardcoded as part of the commit logic .Changing the s3 commitors in s3file 
props(hadoop_s3_properties ) have no impact on the current behaviour.
   
   Is it possible to write directly to the target location, bypassing the temp 
directory and move phase?
   Are there any recommended workarounds or best practices to mitigate the 
performance impact of this move operation for large-scale jobs?
   
   ### SeaTunnel Version
   
   2.3.10
   
   ### SeaTunnel Config
   
   ```conf
   env {
     # common parameter
     job.mode = "BATCH"
     parallelism = 4
   
     # spark special parameter
     spark.app.name = "example"
     spark.sql.catalogImplementation = "hive"
     spark.executor.memory= "40g"
     spark.executor.memoryOverhead="20g"
     spark.executor.cores=5
     spark.driver.memory="10g"
     spark.executor.instances = "10"
     spark.yarn.priority = "100"
     spark.yarn.principal="EXAMPLE@PRINCIPLE"
     spark.yarn.keytab="sample.keytab"
     spark.hadoop.fs.defaultFS="hdfs://path"
     spark.dynamicAllocation.enabled="false"
    
   }
   
   source {
       Jdbc {
           url ="jdbc:url"
           driver = "oracle.jdbc.OracleDriver"
           user = "user"
           password = "Passowrd
           query = "SELECT * FROM table"
           result_table_name="t1"
       }
   
   }
   
   sink {
     S3File {
       bucket = "s3a://test-bucket"
       path="/s3_path"
       fs.s3a.endpoint="http://host:port";
     
fs.s3a.aws.credentials.provider="org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
       file_format_type = "parquet"
       access_key="********"
       secret_key="******************"
       data_save_mode="DROP_DATA"
       source_table_name="t1"
     }
   }
   ```
   
   ### Running Command
   
   ```shell
   ./bin/start-seatunnel-spark-3-connector-v2.sh \   --master yarn 
--deploy-mode cluster --config ./config/oracle_test.conf
   ```
   
   ### Error Exception
   
   ```log
   hadoop.HadoopFileSystemProxy: [Driver]: rename file 
:[/tmp/seatunnel/seatunnel/8e86853037674835bfabb64e3bae1b91/6b760a582e/T_8e86853037674835bfabb64e3bae1b91_6b760a582e_2_1/NON_PARTITION/T_8e86853037674835bfabb64e3bae1b91_6b760a582e_2_1_0.parquet]
 to 
[/datalake_staging/seatunnel_test/oracle_data_prosaachine_non_atomic/T_8e86853037674835bfabb64e3bae1b91_6b760a582e_2_1_0.parquet]
 finish
   ```
   
   ### Zeta or Flink or Spark Version
   
   Spark version : 3.3.2
   
   ### Java or Scala Version
   
   Java : 1.8
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] S3 File Sink: Move Operation from Temp to Target Location Causes Significant Job Completion Delay with Large Number of Files [seatunnel]

Reply via email to