[jira] [Updated] (SPARK-40564) The distributed runtime has one more identical process with a small amount of data on the master

YuNing Liu (Jira) Mon, 26 Sep 2022 06:03:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


YuNing Liu updated SPARK-40564:
-------------------------------
    Description: When I ran my program with the Dataframe structure in 
Pyspark.PANDAS, there is an abnormal extra process on the master. My dataframe 
contains three columns named "id", "path", and "category". It contains more 
than 300,000 pieces of data in total, and the "id" values are only 1, 2, 3, and 
4. When I use "groupBy (" id ").apply(func)", my four nodes run normally, but 
there is an abnormal process in the master, which contains 1001 pieces of data. 
This process also executes the code in "func" and is divided into four parts, 
each part contains more than 200 pieces of data. When I collect the results in 
each node, I can only collect the results of 1001 data points, and the results 
of 300,000 data points are lost. When I tried to reduce the number of data to 
about 20,000, this problem still occurred and the data volume was still 1001. I 
suspect there is a problem with the implementation of this API.I tried setting 
the number of data partitions to 4, but the problem didn't go away.The value of 
the dataframe, part of the code, and the output of the exception process are 
attached  (was: When I ran my program with the Dataframe structure in 
Pyspark.PANDAS, there is an abnormal extra process on the master. My dataframe 
contains three columns named "id", "path", and "category". It contains more 
than 300,000 pieces of data in total, and the "id" values are only 1, 2, 3, and 
4. When I use "groupBy (" id ").apply(func)", my four nodes run normally, but 
there is an abnormal process in the master, which contains 1001 pieces of data. 
This process also executes the code in "func" and is divided into four parts, 
each part contains more than 200 pieces of data. When I collect the results in 
each node, I can only collect the results of 1001 data points, and the results 
of 300,000 data points are lost. When I tried to reduce the number of data to 
about 20,000, this problem still occurred and the data volume was still 1001. I 
suspect there is a problem with the implementation of this API.I tried setting 
the number of data partitions to 4, but the problem didn't go away.)

> The distributed runtime has one more identical process with a small amount of 
> data on the master
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40564
>                 URL: https://issues.apache.org/jira/browse/SPARK-40564
>             Project: Spark
>          Issue Type: Question
>          Components: PySpark
>    Affects Versions: 3.3.0
>         Environment: Hadoop 3.3.1
> 蟒蛇3.8
> 火花3.3.0
> pyspark 3.3.0
> ubuntu 20.04
>            Reporter: YuNing Liu
>            Priority: Blocker
>         Attachments: Part of the code.png, The output of the abnormal 
> process.png, Value of df.png
>
>
> When I ran my program with the Dataframe structure in Pyspark.PANDAS, there 
> is an abnormal extra process on the master. My dataframe contains three 
> columns named "id", "path", and "category". It contains more than 300,000 
> pieces of data in total, and the "id" values are only 1, 2, 3, and 4. When I 
> use "groupBy (" id ").apply(func)", my four nodes run normally, but there is 
> an abnormal process in the master, which contains 1001 pieces of data. This 
> process also executes the code in "func" and is divided into four parts, each 
> part contains more than 200 pieces of data. When I collect the results in 
> each node, I can only collect the results of 1001 data points, and the 
> results of 300,000 data points are lost. When I tried to reduce the number of 
> data to about 20,000, this problem still occurred and the data volume was 
> still 1001. I suspect there is a problem with the implementation of this 
> API.I tried setting the number of data partitions to 4, but the problem 
> didn't go away.The value of the dataframe, part of the code, and the output 
> of the exception process are attached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40564) The distributed runtime has one more identical process with a small amount of data on the master

Reply via email to