[ 
https://issues.apache.org/jira/browse/SPARK-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601687#comment-14601687
 ] 

Matt Cheah commented on SPARK-8597:
-----------------------------------

I did some more digging. The memory space is taken up by a hash map mapping the 
path to parquet files to the Parquet writer object that will write to that 
path. It's in DynamicPartitionWriterContainer.

YourKit profiling reveals that each output writer takes about 1 MB of space; 
I'm not sure if I'm misreading the profile here, but that would explain the 
memory explosion.

I'm trying to figure out a way to reduce the memory footprint here and write a 
PR accordingly. [~marmbrus] do you have any thoughts around this, perhaps?

> DataFrame partitionBy memory pressure scales extremely poorly
> -------------------------------------------------------------
>
>                 Key: SPARK-8597
>                 URL: https://issues.apache.org/jira/browse/SPARK-8597
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Matt Cheah
>            Priority: Critical
>         Attachments: table.csv
>
>
> I'm running into a strange memory scaling issue when using the partitionBy 
> feature of DataFrameWriter. 
> I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 
> different entries, with size on disk of about 20kb. There are 32 distinct 
> values for column A and 32 distinct values for column B and all these are 
> combined together (column C will contain a random number for each row - it 
> doesn't matter) producing a 32*32 elements data set. I've imported this into 
> Spark and I ran a partitionBy("A", "B") in order to test its performance. 
> This should create a nested directory structure with 32 folders, each of them 
> containing another 32 folders. It uses about 10Gb of RAM and it's running 
> slow. If I increase the number of entries in the table from 32*32 to 128*128, 
> I get Java Heap Space Out Of Memory no matter what value I use for Heap Space 
> variable.
> Scala code:
> {code}
> var df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").load("table.csv") 
> df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)
> {code}
> How I ran the Spark shell:
> {code}
> bin/spark-shell --driver-memory 16g --master local[8] --packages 
> com.databricks:spark-csv_2.10:1.0.3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to