[GitHub] [incubator-seatunnel] CheneyYin opened a new issue, #4502: [Improve][Core/Spark-Starter] Push transform operation from Spark Driver to Executors

via GitHub Wed, 05 Apr 2023 03:01:12 -0700


CheneyYin opened a new issue, #4502:
URL: https://github.com/apache/incubator-seatunnel/issues/4502


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   # Present situation
   In 
`org.apache.seatunnel.core.starter.spark.execution.TransformExecuteProcessor#sparkTransform`,
 all data stored in executors will be transmitted to spark driver, because 
`Dataset<Row>.toLocalIterator` function be invoked.  And all rows will be added 
in list(Pure Memory).
   The implementation of `sparkTransform` will have the following negative 
effects:
   - It will transfer redundant network data.
   - It is prone to OOM failures on the spark driver.
   - It causes parallel computing to degenerate into serial computing.
   
   # Improvement plan
   - Replace `Dataset<Row>.toLocalIterator` with `Dataset<Row>.mapPartitions`.
   - To implement a `Iterator`, which supports lazy compute and never load all 
data into memory.
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-seatunnel] CheneyYin opened a new issue, #4502: [Improve][Core/Spark-Starter] Push transform operation from Spark Driver to Executors

Reply via email to