jieguangzhou opened a new issue, #10738: URL: https://github.com/apache/dolphinscheduler/issues/10738
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar feature requirement. ### Description DolphinScheduler allows parameter transfer between tasks: https://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/guide/parameter/context.html But it can not allows file transfer between tasks. For example, I have two python scripts to do some analysis work. The second script process the data that come from the first script. I have to pass a path variable as a parameter. Parameter passing will not work as expected if the two tasks are not the same worker, because actually, the path is not correct. I think if DolphinScheduler supports this feature, it would be a handy boost for scenarios such as data analysis and machine learning. ### Use case I think we can use the resource center as a file transfer store If the user has enabled the resource center. For example, In the task plugin, we can agree on a new path specification: 1. use `$from_remote(remote_path, local_path)` to download file from remote_path to local_path before task start. 2. use `$to_remote(remote_path, local_path)` to upload file from local_path to remote_path The appeal was inspired by [AWS Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html) ```python base_uri = f"s3://{default_bucket}/abalone" input_data_uri = sagemaker.s3.S3Uploader.upload( local_path=local_path, desired_s3_uri=base_uri, ) input_data = ParameterString( name="InputData", default_value=input_data_uri, ) # This is the path to use directly ProcessingInput(source=input_data, destination="/opt/ml/processing/input") ``` Above is the example of Sagemaker. If DolphinScheduler supports it, it should be easier to use it. Such as ```shell # It will process data and save output data to the local path output/demo.csv, and upload that to bucket1/demo.csv in the resource center after the task is done. python process_data.py --output=$to_remote('bucket1/demo.csv', 'output/demo.csv') ``` ```shell # It will download data from "bucket1/demo.csv" in the resource center and save it to the local path "output/demo.csv" # and than the following command actually executes # python analysis.py --input=data/demo.csv python analysis.py --input=$from_remote('bucket1/demo.csv', 'data/demo.csv') ``` ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
