Matt Cheah created SPARK-8597:
---------------------------------

             Summary: DataFrame partitionBy scales extremely poorly
                 Key: SPARK-8597
                 URL: https://issues.apache.org/jira/browse/SPARK-8597
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.0
            Reporter: Matt Cheah
            Priority: Critical


I'm running into a strange memory scaling issue when using the partitionBy 
feature of DataFrameWriter. 

I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 
different entries, with size on disk of about 20kb. There are 32 distinct 
values for column A and 32 distinct values for column B and all these are 
combined together (column C will contain a random number for each row - it 
doesn't matter) producing a 32*32 elements data set. I've imported this into 
Spark and I ran a partitionBy("A", "B") in order to test its performance. This 
should create a nested directory structure with 32 folders, each of them 
containing another 32 folders. It uses about 10Gb of RAM and it's running slow. 
If I increase the number of entries in the table from 32*32 to 128*128, I get 
Java Heap Space Out Of Memory no matter what value I use for Heap Space 
variable. Is this a known bug? 

Scala code:
{code}
var df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").load("table.csv") 
df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)
{code}

How I ran the Spark shell:
{code}
bin/spark-shell --driver-memory 16g --master local[8] --packages 
com.databricks:spark-csv_2.10:1.0.3
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to