Thanks very much for your reply.

For solution A,I don't care about data loading but quering,so I changed the 
resource allocation and ensured the number of containers which is running on 
yarn,but I don't feel that it's better for the job.

For solution B,it doesn't work,too.

So I need to provide more details to you.
The job group consists of five jobs,the first four jobs take small part of the 
whole time and they have only one stage,one or two tasks per stage.They are all 
CarbonScan RDD Stage.

The fifth job has four stages,the first two stages are CarbonScan RDD stage, 
first stage has two tasks and takes 2s, second stage has 39 tasks and takes 1s 
,the other two are aggregate stage,each of the two stages has 216 tasks and one 
takes 3s,the other takes 0.9s.

I checked the driver log,and found that the time spend between paring and the 
statement you gave is less than 1 second.


------------------ Original ------------------
From:  "BabuLal"<babulaljangir...@gmail.com>;
Date:  Tue, Apr 3, 2018 02:25 AM
To:  "dev"<dev@carbondata.apache.org>;

Subject:  Re: Problem on carbondata quering performance tuning


Thanks for using Carbondata.

Based on Information you provided , Please try below solutions /Points.

*A.  Tune Resource Allocation *

       You have 55 core/NM , and given spark.executor.cores= 54 which means
one NM will have only one Executor and total you will have only 4 Executor
even you have given spark.executor.instances 10 .  
For Query Execution we need to have more Executor .
Cluster Capacity :- 
Total NM=4

Ideally(most of the case) per Executor 12-15 GB memory  enough .Based on
this we can open 6 Executors in one NM ( 102/15) So according to this you
can configure below parameter and try again

spark.executor.memory 15g
spark.executor.cores 9
spark.executor.instances 24 

Please make sure that Yarn RM shows these 24 containers running(Excluding AM

*B. Table Optimization *
1. Out of 5 table one table yuan_yuan10_STORE_SALES  is Big table having
~1.4 Billion Records  and it has  columns
the column is High cardinality columns ? for High cardinality columns better
to have DICTIONARY_EXCLUDE you can check size of Metedata Folder in carbon
store location. 

2.  ss_sold_date_sk has between filter ,so better to have Int data type of

*C.  Information For Next Analysis *

Please provide below detail 
1. Can you check SparkUI and check how much time CarbonScan RDD Stage has
taken and how much time Aggregate Stage taken ? You can Check DAG . Or send
spark event files or SparkUI snapshot . 
2.  How many task for each Stage ? 
3.  In Driver How much time spend between Parsing and below statement 
  18/04/01 20:49:01 INFO CarbonScanRDD: 
 Identified no.of.blocks: 1,
 no.of.tasks: 1,
 no.of.nodes: 0,
 parallelism: 1
4. Configure enable.query.statistics=true in carbon.proeprties and
Send/Analyze the Time taken by Carbon in executor side.(like time spend in
IO/Dictionary load..) 

For Data loading :- If data are loading with Local Sort then your
configuration is correct (1 Node ,1 Executor) 

Please check with Solution A. it may solve issue, if still exists then
provide requested Information in PointC .


Sent from: 

Reply via email to