[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

lichenglin (JIRA) Mon, 04 Jul 2016 02:58:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361112#comment-15361112
 ]


lichenglin commented on SPARK-16361:
------------------------------------

Here is my whole setting 
{code}
spark.local.dir                  /home/sparktmp
#spark.executor.cores            4
spark.sql.parquet.cacheMetadata  false
spark.port.maxRetries            5000
spark.kryoserializer.buffer.max  1024M
spark.kryoserializer.buffer      5M
spark.master                     spark://agent170:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://agent170:9000/sparklog
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.executor.memory            4g
spark.driver.memory              2g
#spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=256m -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.extraClassPath=/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/mysql-connector-java-5.1.34.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/oracle-driver.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-4.6.0-HBase-1.1-client-whithoutlib-thriserver-fastxml.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-spark-4.6.0-HBase-1.1.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/spark-csv_2.10-1.3.0.jar
spark.driver.extraClassPath=/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/mysql-connector-java-5.1.34.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/oracle-driver.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-4.6.0-HBase-1.1-client-whithoutlib-thriserver-fastxml.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-spark-4.6.0-HBase-1.1.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/spark-csv_2.10-1.3.0.jar
{code}
{code}
export SPARK_WORKER_MEMORY=50g
export SPARK_MASTER_OPTS=-Xmx4096m
export HADOOP_CONF_DIR=/home/hadoop/hadoop-2.6.0
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export 
SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://agent170:9000/sparklog
{code}
here is my command 
/home/hadoop/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --executor-memory 40g  
--executor-cores 12 --class com.bjdv.spark.job.cube.CubeDemo  
/home/hadoop/lib/licl/sparkjob.jar 2016-07-01



> It takes a long time for gc when building cube with  many fields
> ----------------------------------------------------------------
>
>                 Key: SPARK-16361
>                 URL: https://issues.apache.org/jira/browse/SPARK-16361
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.2
>            Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>               sqlContext.udf().register("jidu", (Integer f) -> {
>                       return (f - 1) / 3 + 1;
>               } , DataTypes.IntegerType);
>               DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>                               "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>                               "jidu(month(day)) as jidu");
>               Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>                               20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>               DataFrame cube = new Cuber(b.transform(d))
>                               .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>                               .min("age").sum("zwsc").count().buildcube();
>               
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> Metric        Min     25th percentile Median  75th percentile Max
> Duration      2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time       1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886        736.6 KB / 22258        
> 738.7 KB / 22387        746.6 KB / 22542        748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282       75.8 MB / 1965860       
> 76.2 MB / 1976004       76.4 MB / 1981516       77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

Reply via email to