xuechendi commented on pull request #34396: URL: https://github.com/apache/spark/pull/34396#issuecomment-953449947
@sunchao , yes, That is what we planned and did. We implemented mostly used operators/expressions based on arrow-based ColumnVector(ex: project/filter, join, sort, aggr, window). And for spark, we are planning to submit our Arrow Datasource as a new spark.read.format to directly load parquet/orc. We also planned to submit our arrow based InMemoryStore(RDD cache), RowToColumnarExec, PandasUDF serizalizer(which directly copy ArrowColumnVector underlying buffers to python context) once spark community is aligned on using ArrowWritableColumnVector as an alternative solution for current WritableColumnVector. And the quick answer of adding ArrowWritableColumnVector here instead of using simply load to arrow and use ArrowColumnVector to get data is because we want to use WritableColumnVector APIs in RowToColumnarExec and Parquet Reader to write to arrow format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
