[GitHub] [spark] jpivarski commented on pull request #34505: [SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in Python

GitBox Tue, 09 Nov 2021 11:09:11 -0800


jpivarski commented on pull request #34505:
URL: https://github.com/apache/spark/pull/34505#issuecomment-964451441



   **Thank you!!!** This is exactly what we've been looking for; with this in 
place, we'll be able to process data like
   
   ```python
   >>> import awkward as ak
   >>> df = spark.createDataFrame([([0, 1, 2],), ([],), ([3, 4],), ([5],), ([6, 
7, 8, 9],)], ("x",))
   >>> def func(iterator):
   ...     for batch in iterator:
   ...         akarray = ak.from_arrow(batch)
   ...         combinations = ak.combinations(akarray["x"], 2, axis=1)
   ...         yield from ak.to_arrow_table(ak.Array({"comb": 
combinations})).to_batches()
   ... 
   >>> df.mapInArrow(func, df.schema)
   ```
   
   (Assuming the data with nested structure, such as the `DataFrame[x: 
array<bigint>]` above, are created in some scalable way, most likely by 
[Laurelin](https://github.com/spark-root/laurelin). Our ROOT data are full of 
these nested lists. 
[ak.combinations](https://awkward-array.readthedocs.io/en/latest/_auto/ak.combinations.html),
 etc. are Awkward Array functions for dealing with these things without Python 
iterations. Doing the same thing through Pandas turns all of our vectorized 
arrays into Python lists, but the Spark → Arrow → Awkward → Arrow → Spark 
transformation keeps them as contiguous arrays with minimal memory and CPU 
footprint...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] jpivarski commented on pull request #34505: [SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in Python

Reply via email to