[GitHub] [spark] HyukjinKwon edited a comment on pull request #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

GitBox Sun, 24 Oct 2021 22:55:18 -0700


HyukjinKwon edited a comment on pull request #26783:
URL: https://github.com/apache/spark/pull/26783#issuecomment-950552721



   @tgravescs and @revans2 FYI. I am thinking about introducing an API like 
`DataFrame.mapInArrow` like 
[`DataFrame.mapInPandas`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.mapInPandas.html),
 and `RDD.mapPartitions`.
   
   The API shape would look like: 
   
   Scala:
   
   ```scala
   def mapInArrow(
       f: Iterator[ArrowRecordBatch] => Iterator[ArrowRecordBatch],
       schema: StructType): DataFrame = {
     // ...
   }
   ```
   
   ```scala
   df.mapInArrow(_.map { case arrowBatch: ArrowRecordBatch =>
     // do something with `ArrowRecordBatch` and create new `ArrowRecordBatch`.
     // ...
     arrowBatch
   }, df.schema).show()
   ```
   
   Python:
   
   ```python
   def mapInArrow(
           self,
           func: Callable[Iterator[pyarrow.RecordBatch], 
Iterator[pyarrow.RecordBatch]],
           schema: StructType) -> DataFrame:
       # ...
   ```
   
   ```python
   def do_something(iterator):
       for arrow_batch in iterator:
           # do something with `pyarrow.RecordBatch` and create new 
`pyarrow.RecordBatch`.
           # ...
           yield arrow_batch
   
   df.mapInPandas(do_something, df.schema).show()  
   ```
   
   I would like to make sure if this API can be potentially useful  - I am 
thinking about `RowToColumnarExec` and `ColumnarToRowExec`. If it can be 
leveraged, how do you guys like this?
   
   cc @BryanCutler @viirya @ueshin too FYI - how do you like this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon edited a comment on pull request #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

Reply via email to