[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

Sean Owen (JIRA) Mon, 04 Jul 2016 02:08:01 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361051#comment-15361051
 ]


Sean Owen commented on SPARK-16361:
-----------------------------------

It sounds like you're out of memory, or, haven't configured the memory that you 
think you have. You didn't show these settings. Do you have reason to believe 
40GB is enough? you also say nothing about your data size.

> It takes a long time for gc when building cube with  many fields
> ----------------------------------------------------------------
>
>                 Key: SPARK-16361
>                 URL: https://issues.apache.org/jira/browse/SPARK-16361
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.2
>            Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>               sqlContext.udf().register("jidu", (Integer f) -> {
>                       return (f - 1) / 3 + 1;
>               } , DataTypes.IntegerType);
>               DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>                               "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>                               "jidu(month(day)) as jidu");
>               Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>                               20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>               DataFrame cube = new Cuber(b.transform(d))
>                               .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>                               .min("age").sum("zwsc").count().buildcube();
>               
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> Metric        Min     25th percentile Median  75th percentile Max
> Duration      2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time       1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886        736.6 KB / 22258        
> 738.7 KB / 22387        746.6 KB / 22542        748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282       75.8 MB / 1965860       
> 76.2 MB / 1976004       76.4 MB / 1981516       77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

Reply via email to