[GitHub] [dolphinscheduler] jieguangzhou opened a new issue, #10738: [Enhancement][Task Plugin] Allows file transfer between tasks

GitBox Sat, 02 Jul 2022 01:21:32 -0700


jieguangzhou opened a new issue, #10738:
URL: https://github.com/apache/dolphinscheduler/issues/10738


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar feature requirement.
   
   
   ### Description
   
   DolphinScheduler allows parameter transfer between tasks: 
https://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/guide/parameter/context.html
   
   But it can not allows file transfer between tasks. For example, I have two 
python scripts to do some analysis work. The second script process the data 
that come from the first script. I have to pass a path variable as a parameter. 
 
   
   Parameter passing will not work as expected if the two tasks are not the 
same worker, because actually, the path is not correct.
   
   I think if DolphinScheduler supports this feature, it would be a handy boost 
for scenarios such as data analysis and machine learning.
   
   ### Use case
   
   
   I think we can use the resource center as a file transfer store If the user 
has enabled the resource center. For example, In the task plugin, we can agree 
on a new path specification: 
   1. use `$from_remote(remote_path, local_path)` to download file from 
remote_path to local_path before task start.
   2. use `$to_remote(remote_path, local_path)` to upload file from local_path 
to remote_path
    
   The appeal was inspired by [AWS 
Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html)
   
   ```python
   base_uri = f"s3://{default_bucket}/abalone"
   input_data_uri = sagemaker.s3.S3Uploader.upload(
       local_path=local_path, 
       desired_s3_uri=base_uri,
   )
   input_data = ParameterString(
       name="InputData",
       default_value=input_data_uri,
   )
   
   # This is the path to use directly
   ProcessingInput(source=input_data, destination="/opt/ml/processing/input")
   ```
   
   Above is the example of Sagemaker. If DolphinScheduler supports it, it 
should be easier to use it.
   Such as 
   ```shell
   # It will process data and save output data to the local path 
output/demo.csv, and upload that to bucket1/demo.csv in the resource center 
after the task is done.
   python process_data.py --output=$to_remote('bucket1/demo.csv', 
'output/demo.csv')
   ```
   
   ```shell
   # It will download data from "bucket1/demo.csv" in the resource center and 
save it to the local path "output/demo.csv"
   # and than the following command actually executes
   # python analysis.py --input=data/demo.csv
   python analysis.py --input=$from_remote('bucket1/demo.csv', 'data/demo.csv')
   ```
   
   
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [dolphinscheduler] jieguangzhou opened a new issue, #10738: [Enhancement][Task Plugin] Allows file transfer between tasks

Reply via email to