[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

Sean Owen (JIRA) Mon, 04 Jul 2016 02:45:48 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361099#comment-15361099
 ]


Sean Owen commented on SPARK-16361:
-----------------------------------

You haven't shown your memory settings. It's possible you're not even 
configuring executor memory. In fact, it's likely, given you say that the 
result isn't affected by increasing something to 40GB

> It takes a long time for gc when building cube with  many fields
> ----------------------------------------------------------------
>
>                 Key: SPARK-16361
>                 URL: https://issues.apache.org/jira/browse/SPARK-16361
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.2
>            Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>               sqlContext.udf().register("jidu", (Integer f) -> {
>                       return (f - 1) / 3 + 1;
>               } , DataTypes.IntegerType);
>               DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>                               "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>                               "jidu(month(day)) as jidu");
>               Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>                               20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>               DataFrame cube = new Cuber(b.transform(d))
>                               .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>                               .min("age").sum("zwsc").count().buildcube();
>               
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> Metric        Min     25th percentile Median  75th percentile Max
> Duration      2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time       1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886        736.6 KB / 22258        
> 738.7 KB / 22387        746.6 KB / 22542        748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282       75.8 MB / 1965860       
> 76.2 MB / 1976004       76.4 MB / 1981516       77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

Reply via email to