HyukjinKwon edited a comment on pull request #26783: URL: https://github.com/apache/spark/pull/26783#issuecomment-950552721
@tgravescs and @revans2 FYI. I am thinking about introducing an API like `DataFrame.mapInArrow` like [`DataFrame.mapInPandas`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.mapInPandas.html), and `RDD.mapPartitions`. The API shape would look like: Scala: ```scala def mapInArrow( f: Iterator[ArrowRecordBatch] => Iterator[ArrowRecordBatch], schema: StructType): DataFrame = { // ... } ``` ```scala df.mapInArrow(_.map { case arrowBatch: ArrowRecordBatch => // do something with `ArrowRecordBatch` and create new `ArrowRecordBatch`. // ... arrowBatch }, df.schema).show() ``` Python: ```python def mapInArrow( self, func: Callable[Iterator[pyarrow.RecordBatch], Iterator[pyarrow.RecordBatch]], schema: StructType) -> DataFrame: # ... ``` ```python def do_something(iterator): for arrow_batch in iterator: # do something with `pyarrow.RecordBatch` and create new `pyarrow.RecordBatch`. # ... yield arrow_batch df.mapInPandas(do_something, df.schema).show() ``` I would like to make sure if this API can be potentially useful - I am thinking about `RowToColumnarExec` and `ColumnarToRowExec`. If it can be leveraged, how do you guys like this? cc @BryanCutler @viirya @ueshin too FYI - how do you like this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
