[GitHub] [spark] zhengruifeng commented on pull request #38130: [SPARK-40556][PS][SQL][WIP] Eagerly clean temp RDD cached in `AttachDistributedSequenceExec`

GitBox Thu, 06 Oct 2022 05:31:57 -0700


zhengruifeng commented on PR #38130:
URL: https://github.com/apache/spark/pull/38130#issuecomment-1269956288


   test script:
   ```
   sc.setLogLevel("WARN")
   
   import pyspark.pandas as ps
   
   from pyspark.sql import functions as F, Column, DataFrame as SparkDataFrame
   
   #Read parquet data and convert it to pandas_api
   # pandas_hope_df = 
spark.read.parquet("/mnt/edl/raw/pdda_data_wrangling/dg39759/bom_explosion/hope_df").pandas_api()
   pandas_hope_df = spark.range(0, 100, 1, 10).withColumn("TOP_ASSEMBLY", 
F.col("id") % 33).pandas_api()
   
   #ps.set_option('compute.default_index_type', 'distributed')
   
   #add index row for grouping of Top Assembly Column
   ps.set_option('compute.ops_on_diff_frames', True)
   df_index = pandas_hope_df['TOP_ASSEMBLY'].reset_index()
   df_index['index'] = 
(df_index.groupby(['TOP_ASSEMBLY']).cumcount()==0).astype(int)
   df_index['index'] = df_index['index'].cumsum()
   
   #merge original pandas_hope_df & generated df_index, using TOP_ASSEMBLY as 
the key
   df_out = ps.merge(pandas_hope_df, df_index, how="inner", 
left_on=["TOP_ASSEMBLY"], right_on=["TOP_ASSEMBLY"])
   ps.reset_option('compute.ops_on_diff_frames')
   
   #convert resulting pandas dataframe back to spark dataframe.
   spark_df_out = df_out.to_spark()
   
   #spark_df_out.cache().count()
   spark_df_out.count()
   
   sc._jsc.getPersistentRDDs()
   ```
   
   before:
   ```
   Out[1]: {25: JavaObject id=o424, 7: JavaObject id=o425, 16: JavaObject 
id=o426}
   ```
   
   
   
   after:
   ```
   22/10/06 20:31:37 WARN AttachDistributedSequenceExec: clean up cached 
RDD(10) in AttachDistributedSequenceExec(197)
   22/10/06 20:31:37 WARN AttachDistributedSequenceExec: clean up cached 
RDD(21) in AttachDistributedSequenceExec(404)
   22/10/06 20:31:37 WARN AttachDistributedSequenceExec: clean up cached 
RDD(32) in AttachDistributedSequenceExec(535)
   Out[1]: {}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #38130: [SPARK-40556][PS][SQL][WIP] Eagerly clean temp RDD cached in `AttachDistributedSequenceExec`

Reply via email to