[jira] [Commented] (SPARK-8597) DataFrame partitionBy memory pressure scales extremely poorly

Vlad Ionescu (JIRA) Tue, 30 Jun 2015 11:18:56 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608791#comment-14608791
 ]


Vlad Ionescu commented on SPARK-8597:
-------------------------------------

I did some stress tests, the main purpose was to make ExternalAppendOnlyMap to 
flush data on HDD. So, I tested with a table with 3 columns and 10.000.000 
rows, using 8 threads and it produced around 10 spills (calls to flush() 
function) per thread. It performs extremely poorly! It takes for over 10 
minutes to write all the data and create all the partitions (for this test-case 
there should be 10.000 folders to be created). Do you think ExternalSort would 
be a better idea than using ExternalAppendOnlyMap? Thank you.

> DataFrame partitionBy memory pressure scales extremely poorly
> -------------------------------------------------------------
>
>                 Key: SPARK-8597
>                 URL: https://issues.apache.org/jira/browse/SPARK-8597
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Matt Cheah
>            Priority: Blocker
>         Attachments: table.csv
>
>
> I'm running into a strange memory scaling issue when using the partitionBy 
> feature of DataFrameWriter. 
> I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 
> different entries, with size on disk of about 20kb. There are 32 distinct 
> values for column A and 32 distinct values for column B and all these are 
> combined together (column C will contain a random number for each row - it 
> doesn't matter) producing a 32*32 elements data set. I've imported this into 
> Spark and I ran a partitionBy("A", "B") in order to test its performance. 
> This should create a nested directory structure with 32 folders, each of them 
> containing another 32 folders. It uses about 10Gb of RAM and it's running 
> slow. If I increase the number of entries in the table from 32*32 to 128*128, 
> I get Java Heap Space Out Of Memory no matter what value I use for Heap Space 
> variable.
> Scala code:
> {code}
> var df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").load("table.csv") 
> df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)
> {code}
> How I ran the Spark shell:
> {code}
> bin/spark-shell --driver-memory 16g --master local[8] --packages 
> com.databricks:spark-csv_2.10:1.0.3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-8597) DataFrame partitionBy memory pressure scales extremely poorly

Reply via email to