lichenglin created SPARK-16361:
----------------------------------

             Summary: It takes a long time for gc when building cube with  many 
fields
                 Key: SPARK-16361
                 URL: https://issues.apache.org/jira/browse/SPARK-16361
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.6.2
            Reporter: lichenglin


I'm using spark to build cube on a dataframe with 1m data.
I found that when I add too many fields (about 8 or above) 
the worker takes a lot of time for GC.
I try to increase the memory of each worker but it not work well.
but I don't know why,sorry.
here is my simple code and monitoring 
Cuber is a util class for building cube.

{code:title=Bar.java|borderStyle=solid}
                sqlContext.udf().register("jidu", (Integer f) -> {
                        return (f - 1) / 3 + 1;

                } , DataTypes.IntegerType);
                DataFrame d = 
sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as double) 
as c_age",
                                "month(day) as month", "year(day) as year", 
"cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
                                "jidu(month(day)) as jidu");
                Bucketizer b = new 
Bucketizer().setInputCol("c_age").setSplits(new double[] { 
Double.NEGATIVE_INFINITY, 0, 10,
                                20, 30, 40, 50, 60, 70, 80, 90, 100, 
Double.POSITIVE_INFINITY }).setOutputCol("age");
                DataFrame cube = new Cuber(b.transform(d))
                                .addFields("day", "AREA_CODE", "CUST_TYPE", 
"age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
                                .min("age").sum("zwsc").count().buildcube();
                
cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
{code}
Summary Metrics for 12 Completed Tasks

Metric  Min     25th percentile Median  75th percentile Max
Duration        2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
Shuffle Read Size / Records     728.4 KB / 21886        736.6 KB / 22258        
738.7 KB / 22387        746.6 KB / 22542        748.6 KB / 22783
Shuffle Write Size / Records    74.3 MB / 1926282       75.8 MB / 1965860       
76.2 MB / 1976004       76.4 MB / 1981516       77.9 MB / 2021142




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to