kyehe opened a new issue, #3618:
URL: https://github.com/apache/incubator-seatunnel/issues/3618

   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Branch: dev
   version: 2.1.3
   spark batch job on yarn
   
   when I run st job with spark mode, I found that the spark job has two 
stages, one is source reader stage, another is sink wirter stage. according to 
the logs, the second will not execute before the source reader finish. 
   
   so, for the batch job, especially for read a large of data from source like 
a big mysql table, it will spend so much time to read data with jdbc 
connection, while the sink writer need to start write when source reader has 
read one single line of data. the parallelly mode will speed up the whole job 
as soon as possible.
   
   the picture blew is a test job to read a big(5+ billion lines) mysql table 
which has not primary key and write into a hive table.
   
   source parallism is default value 1,  the sink parallism I configured it to 
30.
   
   
![image](https://user-images.githubusercontent.com/31163620/204819248-a48296dc-d586-40fb-8c65-d8bab307cd52.png)
   
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to