[
https://issues.apache.org/jira/browse/SPARK-47650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
yinan zhan updated SPARK-47650:
-------------------------------
Description:
The data in Partition 0 was executed twice.
Here is the reproduction code; the issue occurs every time.
It appears that the string starting with "0cool" is printed twice.
{code:java}
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("llm hard negs records") \
.master(f"local[8]") \
.getOrCreate()
def process_partition(index, partition):
results = []
s = 0
for _ in partition:
row = {"Result": "cool"}
results.append(row)
s += 1
print(str(index) + "cool" + str(s))
return results
data = list(range(2000))
results_rdd =
spark.sparkContext.parallelize(data).repartition(8).mapPartitionsWithIndex(process_partition)
results_df = results_rdd.toDF(["Query"])
output_path = "/tmp/bc_inputs6"
results_df.write.json(output_path, mode="overwrite")
{code}
was:
The data in Partition 0 was executed twice.
Here is the reproduction code; the issue occurs every time.
{code:java}
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("llm hard negs records") \
.master(f"local[8]") \
.getOrCreate()
def process_partition(index, partition):
results = []
s = 0
for _ in partition:
row = {"Result": "cool"}
results.append(row)
s += 1
print(str(index) + "cool" + str(s))
return results
data = list(range(2000))
results_rdd =
spark.sparkContext.parallelize(data).repartition(8).mapPartitionsWithIndex(process_partition)
results_df = results_rdd.toDF(["Query"])
output_path = "/tmp/bc_inputs6"
results_df.write.json(output_path, mode="overwrite")
{code}
> Spark DataFrame processed the same partition twice, which can be reproduced
> with very simple code.
> --------------------------------------------------------------------------------------------------
>
> Key: SPARK-47650
> URL: https://issues.apache.org/jira/browse/SPARK-47650
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 3.5.1
> Reporter: yinan zhan
> Priority: Blocker
>
> The data in Partition 0 was executed twice.
> Here is the reproduction code; the issue occurs every time.
> It appears that the string starting with "0cool" is printed twice.
>
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder \
> .appName("llm hard negs records") \
> .master(f"local[8]") \
> .getOrCreate()
> def process_partition(index, partition):
> results = []
> s = 0
> for _ in partition:
> row = {"Result": "cool"}
> results.append(row)
> s += 1
> print(str(index) + "cool" + str(s))
> return results
> data = list(range(2000))
> results_rdd =
> spark.sparkContext.parallelize(data).repartition(8).mapPartitionsWithIndex(process_partition)
> results_df = results_rdd.toDF(["Query"])
> output_path = "/tmp/bc_inputs6"
> results_df.write.json(output_path, mode="overwrite")
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]