[jira] [Created] (SPARK-50462) Spark creates unnecessary copies of data during explode operations

Alberto Andreotti (Jira) Sat, 30 Nov 2024 06:51:09 -0800

Alberto Andreotti created SPARK-50462:
-----------------------------------------


             Summary: Spark creates unnecessary copies of data during explode 
operations
                 Key: SPARK-50462
                 URL: https://issues.apache.org/jira/browse/SPARK-50462
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0
         Environment: Ubuntu 22, Python 3.8, Spark 3.5.0.
            Reporter: Alberto Andreotti


+*Observed Behavior:*+

When some explode() operation is performed and copies of elements in rows are 
created, the memory grows proportionally to the number of new copies.

+*Expected Behavior:*+

Memory should be kept constant regardless of the new rows because elements in 
rows are {*}+immutable+{*}, so there's no point in keeping multiple copies.

At most, memory should grow with the number of partitions.

 

+*Discussion:*+

This is especially important when the data you're duplicating is a heavy 
buffer, for example a binary file. You don't want multiple copies be created in 
memory for the binary file. Also, it's not necessary because explode() 
operations don't affect the value of elements.

Check this example,

`

from pyspark.sql import Row
from pyspark.sql.functions import lit, col, explode, array

# Assuming dicom_df is already loaded
dicom_df = spark.read.format("binaryFile").load(my_200MB_file)

# Define the value of K
K = 14

# Create a range column [0, 1, ..., K-1] for each row in the DataFrame
ranges = array([lit(i) for i in range(K)])  # Array of integers from 0 to K-1

# Add a new column with the range array and explode it
expanded_df = (
    dicom_df
    .withColumn("range", ranges)  # Add range array column
    .select(col("content"), explode(col("range")).alias("index"))  # Explode to 
create one row per range value
)

# Do something with the DF
expanded_df.write.parquet("./pepe.parquet")

`

The number of copies of the binary buffer shouldn't change with 'K'.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-50462) Spark creates unnecessary copies of data during explode operations

Reply via email to