CheneyYin opened a new issue, #4502: URL: https://github.com/apache/incubator-seatunnel/issues/4502
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description # Present situation In `org.apache.seatunnel.core.starter.spark.execution.TransformExecuteProcessor#sparkTransform`, all data stored in executors will be transmitted to spark driver, because `Dataset<Row>.toLocalIterator` function be invoked. And all rows will be added in list(Pure Memory). The implementation of `sparkTransform` will have the following negative effects: - It will transfer redundant network data. - It is prone to OOM failures on the spark driver. - It causes parallel computing to degenerate into serial computing. # Improvement plan - Replace `Dataset<Row>.toLocalIterator` with `Dataset<Row>.mapPartitions`. - To implement a `Iterator`, which supports lazy compute and never load all data into memory. ### Usage Scenario _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
