[
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361099#comment-15361099
]
Sean Owen commented on SPARK-16361:
-----------------------------------
You haven't shown your memory settings. It's possible you're not even
configuring executor memory. In fact, it's likely, given you say that the
result isn't affected by increasing something to 40GB
> It takes a long time for gc when building cube with many fields
> ----------------------------------------------------------------
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.6.2
> Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above)
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
> sqlContext.udf().register("jidu", (Integer f) -> {
> return (f - 1) / 3 + 1;
> } , DataTypes.IntegerType);
> DataFrame d =
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as
> double) as c_age",
> "month(day) as month", "year(day) as year",
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
> "jidu(month(day)) as jidu");
> Bucketizer b = new
> Bucketizer().setInputCol("c_age").setSplits(new double[] {
> Double.NEGATIVE_INFINITY, 0, 10,
> 20, 30, 40, 50, 60, 70, 80, 90, 100,
> Double.POSITIVE_INFINITY }).setOutputCol("age");
> DataFrame cube = new Cuber(b.transform(d))
> .addFields("day", "AREA_CODE", "CUST_TYPE",
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
> .min("age").sum("zwsc").count().buildcube();
>
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> Metric Min 25th percentile Median 75th percentile Max
> Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records 728.4 KB / 21886 736.6 KB / 22258
> 738.7 KB / 22387 746.6 KB / 22542 748.6 KB / 22783
> Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860
> 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]