zhengruifeng commented on PR #38130:
URL: https://github.com/apache/spark/pull/38130#issuecomment-1269956288
test script:
```
sc.setLogLevel("WARN")
import pyspark.pandas as ps
from pyspark.sql import functions as F, Column, DataFrame as SparkDataFrame
#Read parquet data and convert it to pandas_api
# pandas_hope_df =
spark.read.parquet("/mnt/edl/raw/pdda_data_wrangling/dg39759/bom_explosion/hope_df").pandas_api()
pandas_hope_df = spark.range(0, 100, 1, 10).withColumn("TOP_ASSEMBLY",
F.col("id") % 33).pandas_api()
#ps.set_option('compute.default_index_type', 'distributed')
#add index row for grouping of Top Assembly Column
ps.set_option('compute.ops_on_diff_frames', True)
df_index = pandas_hope_df['TOP_ASSEMBLY'].reset_index()
df_index['index'] =
(df_index.groupby(['TOP_ASSEMBLY']).cumcount()==0).astype(int)
df_index['index'] = df_index['index'].cumsum()
#merge original pandas_hope_df & generated df_index, using TOP_ASSEMBLY as
the key
df_out = ps.merge(pandas_hope_df, df_index, how="inner",
left_on=["TOP_ASSEMBLY"], right_on=["TOP_ASSEMBLY"])
ps.reset_option('compute.ops_on_diff_frames')
#convert resulting pandas dataframe back to spark dataframe.
spark_df_out = df_out.to_spark()
#spark_df_out.cache().count()
spark_df_out.count()
sc._jsc.getPersistentRDDs()
```
before:
```
Out[1]: {25: JavaObject id=o424, 7: JavaObject id=o425, 16: JavaObject
id=o426}
```
after:
```
22/10/06 20:31:37 WARN AttachDistributedSequenceExec: clean up cached
RDD(10) in AttachDistributedSequenceExec(197)
22/10/06 20:31:37 WARN AttachDistributedSequenceExec: clean up cached
RDD(21) in AttachDistributedSequenceExec(404)
22/10/06 20:31:37 WARN AttachDistributedSequenceExec: clean up cached
RDD(32) in AttachDistributedSequenceExec(535)
Out[1]: {}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]