[
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361051#comment-15361051
]
Sean Owen commented on SPARK-16361:
-----------------------------------
It sounds like you're out of memory, or, haven't configured the memory that you
think you have. You didn't show these settings. Do you have reason to believe
40GB is enough? you also say nothing about your data size.
> It takes a long time for gc when building cube with many fields
> ----------------------------------------------------------------
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.6.2
> Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above)
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
> sqlContext.udf().register("jidu", (Integer f) -> {
> return (f - 1) / 3 + 1;
> } , DataTypes.IntegerType);
> DataFrame d =
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as
> double) as c_age",
> "month(day) as month", "year(day) as year",
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
> "jidu(month(day)) as jidu");
> Bucketizer b = new
> Bucketizer().setInputCol("c_age").setSplits(new double[] {
> Double.NEGATIVE_INFINITY, 0, 10,
> 20, 30, 40, 50, 60, 70, 80, 90, 100,
> Double.POSITIVE_INFINITY }).setOutputCol("age");
> DataFrame cube = new Cuber(b.transform(d))
> .addFields("day", "AREA_CODE", "CUST_TYPE",
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
> .min("age").sum("zwsc").count().buildcube();
>
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> Metric Min 25th percentile Median 75th percentile Max
> Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records 728.4 KB / 21886 736.6 KB / 22258
> 738.7 KB / 22387 746.6 KB / 22542 748.6 KB / 22783
> Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860
> 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]