Thanks very much for your reply.
For solution A,I don't care about data loading but quering,so I changed the
resource allocation and ensured the number of containers which is running on
yarn,but I don't feel that it's better for the job.
For solution B,it doesn't work,too.
So I need to provide more details to you.
The job group consists of five jobs,the first four jobs take small part of the
whole time and they have only one stage,one or two tasks per stage.They are all
CarbonScan RDD Stage.
The fifth job has four stages,the first two stages are CarbonScan RDD stage,
first stage has two tasks and takes 2s, second stage has 39 tasks and takes 1s
,the other two are aggregate stage,each of the two stages has 216 tasks and one
takes 3s,the other takes 0.9s.
I checked the driver log,and found that the time spend between paring and the
statement you gave is less than 1 second.
------------------ Original ------------------
Date: Tue, Apr 3, 2018 02:25 AM
Subject: Re: Problem on carbondata quering performance tuning
Thanks for using Carbondata.
Based on Information you provided , Please try below solutions /Points.
*A. Tune Resource Allocation *
You have 55 core/NM , and given spark.executor.cores= 54 which means
one NM will have only one Executor and total you will have only 4 Executor
even you have given spark.executor.instances 10 .
For Query Execution we need to have more Executor .
Cluster Capacity :-
Ideally(most of the case) per Executor 12-15 GB memory enough .Based on
this we can open 6 Executors in one NM ( 102/15) So according to this you
can configure below parameter and try again
Please make sure that Yarn RM shows these 24 containers running(Excluding AM
*B. Table Optimization *
1. Out of 5 table one table yuan_yuan10_STORE_SALES is Big table having
~1.4 Billion Records and it has columns
SS_SOLD_DATE_SK,SS_ITEM_SK,SS_CUSTOMER_SK as DICTIONARY_INCLUDE , is any of
the column is High cardinality columns ? for High cardinality columns better
to have DICTIONARY_EXCLUDE you can check size of Metedata Folder in carbon
2. ss_sold_date_sk has between filter ,so better to have Int data type of
*C. Information For Next Analysis *
Please provide below detail
1. Can you check SparkUI and check how much time CarbonScan RDD Stage has
taken and how much time Aggregate Stage taken ? You can Check DAG . Or send
spark event files or SparkUI snapshot .
2. How many task for each Stage ?
3. In Driver How much time spend between Parsing and below statement
18/04/01 20:49:01 INFO CarbonScanRDD:
Identified no.of.blocks: 1,
4. Configure enable.query.statistics=true in carbon.proeprties and
Send/Analyze the Time taken by Carbon in executor side.(like time spend in
For Data loading :- If data are loading with Local Sort then your
configuration is correct (1 Node ,1 Executor)
Please check with Solution A. it may solve issue, if still exists then
provide requested Information in PointC .