[jira] [Updated] (SPARK-40564) The distributed runtime has one more identical process with a small amount of data on the master

YuNing Liu (Jira) Mon, 26 Sep 2022 06:02:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


YuNing Liu updated SPARK-40564:
-------------------------------
    Attachment: Value of df.png

> The distributed runtime has one more identical process with a small amount of 
> data on the master
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40564
>                 URL: https://issues.apache.org/jira/browse/SPARK-40564
>             Project: Spark
>          Issue Type: Question
>          Components: PySpark
>    Affects Versions: 3.3.0
>         Environment: Hadoop 3.3.1
> 蟒蛇3.8
> 火花3.3.0
> pyspark 3.3.0
> ubuntu 20.04
>            Reporter: YuNing Liu
>            Priority: Blocker
>         Attachments: Part of the code.png, The output of the abnormal 
> process.png, Value of df.png
>
>
> When I ran my program with the Dataframe structure in Pyspark.PANDAS, there 
> is an abnormal extra process on the master. My dataframe contains three 
> columns named "id", "path", and "category". It contains more than 300,000 
> pieces of data in total, and the "id" values are only 1, 2, 3, and 4. When I 
> use "groupBy (" id ").apply(func)", my four nodes run normally, but there is 
> an abnormal process in the master, which contains 1001 pieces of data. This 
> process also executes the code in "func" and is divided into four parts, each 
> part contains more than 200 pieces of data. When I collect the results in 
> each node, I can only collect the results of 1001 data points, and the 
> results of 300,000 data points are lost. When I tried to reduce the number of 
> data to about 20,000, this problem still occurred and the data volume was 
> still 1001. I suspect there is a problem with the implementation of this 
> API.I tried setting the number of data partitions to 4, but the problem 
> didn't go away.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40564) The distributed runtime has one more identical process with a small amount of data on the master

Reply via email to