prashant462 opened a new issue, #9339: URL: https://github.com/apache/seatunnel/issues/9339
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues. ### What happened When using the S3 file sink in SeaTunnel, I have observed that data is first written to a temporary directory (e.g., /tmp/seatunnel in the S3 bucket) and, during the commit phase, all files are moved to their final target location. This move/rename operation is performed on the driver node and can take hours to complete when there are a large number of files, significantly increasing the overall job completion time. Currently, there is no configuration option to bypass this two-phase commit process and write files directly to the target location. The move/rename step is hardcoded as part of the commit logic .Changing the s3 commitors in s3file props(hadoop_s3_properties ) have no impact on the current behaviour. Is it possible to write directly to the target location, bypassing the temp directory and move phase? Are there any recommended workarounds or best practices to mitigate the performance impact of this move operation for large-scale jobs? ### SeaTunnel Version 2.3.10 ### SeaTunnel Config ```conf env { # common parameter job.mode = "BATCH" parallelism = 4 # spark special parameter spark.app.name = "example" spark.sql.catalogImplementation = "hive" spark.executor.memory= "40g" spark.executor.memoryOverhead="20g" spark.executor.cores=5 spark.driver.memory="10g" spark.executor.instances = "10" spark.yarn.priority = "100" spark.yarn.principal="EXAMPLE@PRINCIPLE" spark.yarn.keytab="sample.keytab" spark.hadoop.fs.defaultFS="hdfs://path" spark.dynamicAllocation.enabled="false" } source { Jdbc { url ="jdbc:url" driver = "oracle.jdbc.OracleDriver" user = "user" password = "Passowrd query = "SELECT * FROM table" result_table_name="t1" } } sink { S3File { bucket = "s3a://test-bucket" path="/s3_path" fs.s3a.endpoint="http://host:port" fs.s3a.aws.credentials.provider="org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" file_format_type = "parquet" access_key="********" secret_key="******************" data_save_mode="DROP_DATA" source_table_name="t1" } } ``` ### Running Command ```shell ./bin/start-seatunnel-spark-3-connector-v2.sh \ --master yarn --deploy-mode cluster --config ./config/oracle_test.conf ``` ### Error Exception ```log hadoop.HadoopFileSystemProxy: [Driver]: rename file :[/tmp/seatunnel/seatunnel/8e86853037674835bfabb64e3bae1b91/6b760a582e/T_8e86853037674835bfabb64e3bae1b91_6b760a582e_2_1/NON_PARTITION/T_8e86853037674835bfabb64e3bae1b91_6b760a582e_2_1_0.parquet] to [/datalake_staging/seatunnel_test/oracle_data_prosaachine_non_atomic/T_8e86853037674835bfabb64e3bae1b91_6b760a582e_2_1_0.parquet] finish ``` ### Zeta or Flink or Spark Version Spark version : 3.3.2 ### Java or Scala Version Java : 1.8 ### Screenshots _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
