[ https://issues.apache.org/jira/browse/SPARK-40564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
YuNing Liu updated SPARK-40564: ------------------------------- Attachment: Value of df.png > The distributed runtime has one more identical process with a small amount of > data on the master > ------------------------------------------------------------------------------------------------ > > Key: SPARK-40564 > URL: https://issues.apache.org/jira/browse/SPARK-40564 > Project: Spark > Issue Type: Question > Components: PySpark > Affects Versions: 3.3.0 > Environment: Hadoop 3.3.1 > 蟒蛇3.8 > 火花3.3.0 > pyspark 3.3.0 > ubuntu 20.04 > Reporter: YuNing Liu > Priority: Blocker > Attachments: Part of the code.png, The output of the abnormal > process.png, Value of df.png > > > When I ran my program with the Dataframe structure in Pyspark.PANDAS, there > is an abnormal extra process on the master. My dataframe contains three > columns named "id", "path", and "category". It contains more than 300,000 > pieces of data in total, and the "id" values are only 1, 2, 3, and 4. When I > use "groupBy (" id ").apply(func)", my four nodes run normally, but there is > an abnormal process in the master, which contains 1001 pieces of data. This > process also executes the code in "func" and is divided into four parts, each > part contains more than 200 pieces of data. When I collect the results in > each node, I can only collect the results of 1001 data points, and the > results of 300,000 data points are lost. When I tried to reduce the number of > data to about 20,000, this problem still occurred and the data volume was > still 1001. I suspect there is a problem with the implementation of this > API.I tried setting the number of data partitions to 4, but the problem > didn't go away. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org