pan3793 opened a new issue, #3863: URL: https://github.com/apache/incubator-kyuubi/issues/3863
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/incubator-kyuubi/issues) and found no similar issues. ### Describe the proposal Currently, Kyuubi supports thrift-based HS2 protocol, and the results transmission is not efficient enough. For the Spark engine, the main pain points are: - Driver has high memory pressure because it needs to collect RDD as InternelRow, convert to Row, and convert to TRow(row-based or columnar-based, which depends on the client protocol) before sending it back to Kyuubi Server, which typically consumes several times memory size than the data stored in the parquet file. - The data conversion happens on the Driver side, consuming much CPU time as well. - The protocol does not support compression, compression is quite helpful for network bandwidth-limited scenarios. Apache Arrow is a columnar-based format that is a more efficient format for data transmission, it is adopted by PySpark as the data serialization format between JVM and Python Process, and will be adopted by the ongoing Spark Connect. Kyuubi can support fetching data in Arrow format to improve the results transmission efficiency. The core ideas are: - convert data on the executor side before collecting to driver - driver collects arrow results and sends them back to the server directly - encode arrow data as thrift binary data, and set a flag to indicate the client should decode the result in arrow format - the client should be updated to support decode arrow format ### Task list - https://github.com/apache/incubator-kyuubi/issues/3794 ### Are you willing to submit PR? - [ ] Yes. I can submit a PR independently to improve. - [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve. - [ ] No. I cannot submit a PR at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
