[GitHub] [spark] HyukjinKwon commented on pull request #33954: [SPARK-36709][PYTHON] Support new syntax for specifying index type and name in pandas API on Spark

GitBox Sun, 12 Sep 2021 17:22:45 -0700


HyukjinKwon commented on pull request #33954:
URL: https://github.com/apache/spark/pull/33954#issuecomment-917742920



   When the index type is not specified, the default index would be attached by 
default for the output DataFrame. Using default index usually brings some 
performance penalty. See the pseudo code below.
   
   Assume we're using distributed sequence index 
(https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type),
 and we perform an panda API such as:
   
   ```python
   df.apply(func)
   ```
   
   Internally the Spark operations as below would be performed (pseudo code)
   
   **Without index type hint:**
   
   ```python
   spark_df = df.to_spark()
   # compute date part
   spark_df.select(udf(func, "data column types without index"))
   # compute index part
   spark.createDataFrame(spark_df.rdd.zipWithIndex().map(lambda p: p[1], p[0])) 
 # attach index
   ```
   
   **With index type hint:**
   
   ```python
   spark_df = df.to_spark()
   spark_df.select(udf(func), "column types with both data and index")
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on pull request #33954: [SPARK-36709][PYTHON] Support new syntax for specifying index type and name in pandas API on Spark

Reply via email to