Re: GC problem and performance refine problem

Kumar Vishal Wed, 16 Nov 2016 05:50:00 -0800

Hi An Lan,
For detail logging you need to add log4j configuration in
spark-default.conf for both driver and executor.
* spark.driver.extraJavaOption = -Dlog4j.configuration=file:/<filepath>*
* spark.executor.extraJavaOption = -Dlog4j.configuration=file:/<filepath>*


Please make sure in* log4j.properties* *log4j.rootCategory* is *INFO*

-Regards
Kumar Vishal







On Wed, Nov 16, 2016 at 2:06 PM, An Lan <lanan...@gmail.com> wrote:

> Hi Kumar Vishal,
>
> Thanks for your suggestion.
> The driver log not contain block distribution log by default. How could I
> open it?
> And how does the order of the dimensions be decided?
>
> 2016-11-16 15:14 GMT+08:00 Kumar Vishal <kumarvishal1...@gmail.com>:
>
> > Hi An Lan,
> >                 Data is already distributed, in this case may be one
> > blocklet is returning more number of rows and other returning less
> because
> > of this some task will take more time.
> >
> > In driver log block distribution log is not present, so it is not clear
> > whether it is going for block distribution or blocklet distribution.
> > distribution, can you please add the detail driver log.
> >
> > *Some suggestion:*
> >
> > 1. If column is String and you are not applying filter then in create
> > statement add it in dictionary exclude this will avoid lookup in
> > dictionary. If number of String column are less then better add it in
> > dictionary exclude.
> >
> > 2. If column is a numeric column and you are not applying filter then no
> > need to add in dictionary include or exclude, so it will be a measure
> > column.
> >
> > 3. Currently in carbon for measure column if you will create as a double
> > data type it will give more compression as currently value compression
> mode
> > is support for double data type.
> >
> > -Regards
> > Kumar Vishal
> >
> >
> > On Wed, Nov 16, 2016 at 8:12 AM, An Lan <lanan...@gmail.com> wrote:
> >
> > > Hi Kumar Vishal,
> > >
> > > 1. I found the quantity of rows filtered out by invert index is not
> > > uniform between different tasks and the difference is large. Some task
> > may
> > > be 3~4k row after filtered, but the longer tasks may be 3~4w. When most
> > > longer task on same node, time cost will be more longer than others.
> So,
> > is
> > > there any way to balance the data and make rows distribute uniform
> > between
> > > task?
> > > Now, the blocklet size is still 120k rows. 3k+ blocket in total. Every
> > > task has 6 blocket.
> > > 2. I upload some logs last time. Is there some access problem about it?
> > Is
> > > there any suggest or question about it?
> > >
> > > 2016-11-14 11:29 GMT+08:00 An Lan <lanan...@gmail.com>:
> > >
> > >> Hi Kumar Vishal,
> > >>
> > >>
> > >> Driver and some executor logs are in the accessory. The same query run
> > >> for five times.
> > >>
> > >>
> > >>
> > >> Time consume for every query:
> > >>
> > >> 67068ms, 45758ms, 26497ms, 22619ms, 21504ms
> > >>
> > >>
> > >> The first stage for every query:
> > >>
> > >> (first query -> start from 0 stage
> > >>
> > >> second query -> 4
> > >>
> > >> third query -> 8
> > >>
> > >> fourth query -> 12
> > >>
> > >> five query -> 16)
> > >>
> > >> see "first stage of queries.png" in accessory
> > >>
> > >>
> > >>
> > >> Longest task for every query:
> > >>
> > >> (Task and log relationship:
> > >>
> > >> stage 0, task 249 -> 1.log
> > >>
> > >> stage 4, task 420 -> 2.log
> > >>
> > >> stage 8, task 195 -> 3.log
> > >>
> > >> stage 12, task 350 -> 4.log
> > >>
> > >> stage 16, task 321 -> 5.log)
> > >> see "longest task.png" in accessory
> > >>
> > >> 
> > >>  longest tasks.png
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7dGY4cF9jRDk1QTg/
> > view?usp=drive_web>
> > >> 
> > >>  first stage of queries.png
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7SVZsNnR2VEw2X0k/
> > view?usp=drive_web>
> > >> 
> > >>  dirverlog
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7VllnVkNnenhyTTA/
> > view?usp=drive_web>
> > >> 
> > >>  1.log
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7eUdUWmlia1J4bk0/
> > view?usp=drive_web>
> > >> 
> > >>  2.log
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7ZUlGZHZVQ3phdmM/
> > view?usp=drive_web>
> > >> 
> > >>  3.log
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7UHpBQzREX3N5aEk/
> > view?usp=drive_web>
> > >> 
> > >>  4.log
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7NVctMmYwNldCVEk/
> > view?usp=drive_web>
> > >> 
> > >>  5.log
> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7ODhHNE5sSGNfOVE/
> > view?usp=drive_web>
> > >> 
> > >>
> > >> 2016-11-14 11:22 GMT+08:00 An Lan <lanan...@gmail.com>:
> > >>
> > >>> Hi,
> > >>>
> > >>>
> > >>>
> > >>> Driver and some executor logs are in the accessory. The same query
> run
> > >>> for five times.
> > >>>
> > >>>
> > >>>
> > >>> Time consume for every query:
> > >>>
> > >>> 67068ms, 45758ms, 26497ms, 22619ms, 21504ms
> > >>>
> > >>>
> > >>>
> > >>> The first stage for every query:
> > >>>
> > >>> (first query -> start from 0 stage
> > >>>
> > >>> second query -> 4
> > >>>
> > >>> third query -> 8
> > >>>
> > >>> fourth query -> 12
> > >>>
> > >>> five query -> 16)
> > >>>
> > >>>
> > >>> Longest task for every query:
> > >>>
> > >>> (Task and log relationship:
> > >>>
> > >>> stage 0, task 249 -> 1.log
> > >>>
> > >>> stage 4, task 420 -> 2.log
> > >>>
> > >>> stage 8, task 195 -> 3.log
> > >>>
> > >>> stage 12, task 350 -> 4.log
> > >>>
> > >>> stage 16, task 321 -> 5.log)
> > >>>
> > >>>
> > >>>
> > >>> 2016-11-11 20:25 GMT+08:00 Kumar Vishal <kumarvishal1...@gmail.com>:
> > >>>
> > >>>> What is the time difference between first time and second time query
> > as
> > >>>> second time it will read from os cache,so i think there wont be any
> IO
> > >>>> bottleneck.
> > >>>>
> > >>>> Can u provide driver log and executor log for task which are taking
> > more
> > >>>> time.
> > >>>>
> > >>>>
> > >>>> -Regards
> > >>>> Kumar Vishal
> > >>>>
> > >>>> On Fri, Nov 11, 2016 at 3:21 PM, An Lan <lanan...@gmail.com> wrote:
> > >>>>
> > >>>> > 1. I have set --num-executors=100 in all test. The image write 100
> > >>>> nodes
> > >>>> > means one executor for one node, and when there is 64 node, there
> > >>>> maybe two
> > >>>> > or three executor on one node. The total executor number is always
> > >>>> 100.
> > >>>> >
> > >>>> > 2. I do not find the property enable.blocklet.distribution in
> > 0.1.1. I
> > >>>> > found it in 0.2.0. I am testing on 0.1.1, and will try it later.
> If
> > >>>> you
> > >>>> > concern locality level, it seems that the most long query is not
> > >>>> caused by
> > >>>> > IO problem. Statistic of the long task statistic like this:
> > >>>> >
> > >>>> > +--------------------+----------------+--------------------+
> > >>>> > ----------------+---------------+-----------+-------------------+
> > >>>> > |             task_id|load_blocks_time|load
> > >>>> _dictionary_time|scan_blocks_
> > >>>> > time|scan_blocks_num|result_size|total_executor_time|
> > >>>> > +--------------------+----------------+--------------------+
> > >>>> > ----------------+---------------+-----------+-------------------+
> > >>>> > |4199754456147478_134|              1 |                 32 |
> > >>>> > 3306 |             3 |     42104 |             13029 |
> > >>>> > +--------------------+----------------+--------------------+
> > >>>> > ----------------+---------------+-----------+-------------------+
> > >>>> >
> > >>>> > I have suffer from IO, but after configure speculation, most tasks
> > >>>> > with a long IO will be resend.
> > >>>> >
> > >>>> >
> > >>>> > 3. distinct value:
> > >>>> >
> > >>>> > g: 11
> > >>>> >
> > >>>> > h: 3
> > >>>> >
> > >>>> > f: 7
> > >>>> >
> > >>>> > e: 3
> > >>>> >
> > >>>> > d: 281
> > >>>> >
> > >>>> > 4. In the query, only e, f ,g, h, d and A are used in filter,
> others
> > >>>> are
> > >>>> > not used. So I think others used in the aggregation are no need
> > added
> > >>>> for
> > >>>> > index.
> > >>>> >
> > >>>> > 5. I have write the e, f, g, h, d in most left just after "create
> > >>>> table..."
> > >>>> >
> > >>>> > 2016-11-11 15:08 GMT+08:00 Kumar Vishal <
> kumarvishal1...@gmail.com
> > >:
> > >>>> >
> > >>>> > > Hi An Lan,
> > >>>> > >
> > >>>> > > Please confirm below things.
> > >>>> > >
> > >>>> > > 1. Is dynamic executor is enabled?? If it is enabled can u
> > disabled
> > >>>> and
> > >>>> > > try(this is to check is there any impact with dynamic executor)
> > >>>> > >     for disabling dynamic executor you need set in
> > >>>> spark-default.conf
> > >>>> > > --num-executors=100
> > >>>> > >
> > >>>> > > 2. Can u please set in below property
> > enable.blocklet.distribution=f
> > >>>> alse
> > >>>> > > and execute the query.
> > >>>> > >
> > >>>> > > 3. Cardinality of each column.
> > >>>> > >
> > >>>> > > 4. Any reason why you are setting "NO_INVERTED_INDEX”=“a” ??
> > >>>> > >
> > >>>> > > 5. Can u keep all the columns which is present in filter on left
> > >>>> side so
> > >>>> > > less no of blocks will identified during block pruning and it
> will
> > >>>> > improve
> > >>>> > > the query performance.
> > >>>> > >
> > >>>> > >
> > >>>> > > -Regards
> > >>>> > > Kumar Vishal
> > >>>> > >
> > >>>> > > On Fri, Nov 11, 2016 at 11:59 AM, An Lan <lanan...@gmail.com>
> > >>>> wrote:
> > >>>> > >
> > >>>> > > > Hi Kumar Vishal,
> > >>>> > > >
> > >>>> > > > 1.       Create table ddl:
> > >>>> > > >
> > >>>> > > > CREATE TABLE IF NOT EXISTS Table1
> > >>>> > > >
> > >>>> > > > (* h Int, g Int, d String, f Int, e Int,*
> > >>>> > > >
> > >>>> > > > a Int, b Int, …(extra near 300 columns)
> > >>>> > > >
> > >>>> > > > STORED BY 'org.apache.carbondata.format'
> > >>>> > > > TBLPROPERTIES(
> > >>>> > > >
> > >>>> > > > "NO_INVERTED_INDEX”=“a”,
> > >>>> > > > "NO_INVERTED_INDEX”=“b”,
> > >>>> > > >
> > >>>> > > > …(extra near 300 columns)
> > >>>> > > >
> > >>>> > > > "DICTIONARY_INCLUDE”=“a”,
> > >>>> > > >
> > >>>> > > > "DICTIONARY_INCLUDE”=“b”,
> > >>>> > > >
> > >>>> > > > …(extra near 300 columns)
> > >>>> > > >
> > >>>> > > > )
> > >>>> > > >
> > >>>> > > > 2.       3. There more than hundreds node in the cluster, but
> > >>>> cluster
> > >>>> > is
> > >>>> > > > used mixed with other application.  Some time when node is
> > >>>> enough, we
> > >>>> > > will
> > >>>> > > > get 100 distinct node.
> > >>>> > > >
> > >>>> > > > 4.      I give a statistic of task time during once query and
> > mark
> > >>>> > > > distinct nodes below:
> > >>>> > > >
> > >>>> > > > [image: 内嵌图片 1]
> > >>>> > > >
> > >>>> > > >
> > >>>> > > >
> > >>>> > > >
> > >>>> > > > 2016-11-10 23:52 GMT+08:00 Kumar Vishal <
> > >>>> kumarvishal1...@gmail.com>:
> > >>>> > > >
> > >>>> > > >> Hi Anning Luo,
> > >>>> > > >>
> > >>>> > > >> Can u please provide below details.
> > >>>> > > >>
> > >>>> > > >> 1.Create table ddl.
> > >>>> > > >> 2.Number of node in you cluster setup.
> > >>>> > > >> 3. Number of executors per node.
> > >>>> > > >> 4. Query statistics.
> > >>>> > > >>
> > >>>> > > >> Please find my comments in bold.
> > >>>> > > >>
> > >>>> > > >> Problem:
> > >>>> > > >> 1.                          GC problem. We suffer a 20%~30%
> GC
> > >>>> time
> > >>>> > for
> > >>>> > > >> some task in first stage after a lot of parameter refinement.
> > We
> > >>>> now
> > >>>> > use
> > >>>> > > >> G1
> > >>>> > > >> GC in java8. GC time will double if use CMS. The main GC time
> > is
> > >>>> spent
> > >>>> > > on
> > >>>> > > >> young generation GC. Almost half memory of young generation
> > will
> > >>>> be
> > >>>> > copy
> > >>>> > > >> to
> > >>>> > > >> old generation. It seems lots of object has a long life than
> GC
> > >>>> period
> > >>>> > > and
> > >>>> > > >> the space is not be reuse(as concurrent GC will release it
> > >>>> later).
> > >>>> > When
> > >>>> > > we
> > >>>> > > >> use a large Eden(>=1G for example), once GC time will be
> > >>>> seconds. If
> > >>>> > set
> > >>>> > > >> Eden little(256M for example), once GC time will be hundreds
> > >>>> > > milliseconds,
> > >>>> > > >> but more frequency and total is still seconds. Is there any
> way
> > >>>> to
> > >>>> > > lessen
> > >>>> > > >> the GC time? (We don’t consider the first query and second
> > query
> > >>>> in
> > >>>> > this
> > >>>> > > >> case.)
> > >>>> > > >>
> > >>>> > > >> *How many node are present in your cluster setup?? If nodes
> are
> > >>>> less
> > >>>> > > >> please
> > >>>> > > >> reduce the number of executors per node.*
> > >>>> > > >>
> > >>>> > > >> 2.                          Performance refine problem. Row
> > >>>> number
> > >>>> > after
> > >>>> > > >> being filtered is not uniform. Some node maybe heavy. It
> spend
> > >>>> more
> > >>>> > time
> > >>>> > > >> than other node. The time of one task is 4s ~ 16s. Is any
> > method
> > >>>> to
> > >>>> > > refine
> > >>>> > > >> it?
> > >>>> > > >>
> > >>>> > > >> 3.                          Too long time for first and
> second
> > >>>> query.
> > >>>> > I
> > >>>> > > >> know dictionary and some index need to be loaded for the
> first
> > >>>> time.
> > >>>> > But
> > >>>> > > >> after I trying use query below to preheat it, it still spend
> a
> > >>>> lot of
> > >>>> > > >> time.
> > >>>> > > >> How could I preheat the query correctly?
> > >>>> > > >>                         select Aarray, a, b, c… from Table1
> > where
> > >>>> > Aarray
> > >>>> > > >> is
> > >>>> > > >> not null and d = “sss” and e !=22 and f = 33 and g = 44 and
> h =
> > >>>> 55
> > >>>> > > >>
> > >>>> > > >> *Currently we are working on first time query improvement.
> For
> > >>>> now you
> > >>>> > > can
> > >>>> > > >> run select count(*) or count(column), so all the blocks get
> > >>>> loaded and
> > >>>> > > >> then
> > >>>> > > >> you can run the actual query.*
> > >>>> > > >>
> > >>>> > > >>
> > >>>> > > >> 4.             Any other suggestion to lessen the query time?
> > >>>> > > >>
> > >>>> > > >>
> > >>>> > > >> Some suggestion：
> > >>>> > > >>             The log by class QueryStatisticsRecorder give me
> a
> > >>>> good
> > >>>> > > means
> > >>>> > > >> to find the neck bottle, but not enough. There still some
> > metric
> > >>>> I
> > >>>> > think
> > >>>> > > >> is
> > >>>> > > >> very useful:
> > >>>> > > >>             1. filter ratio. i.e.. not only result_size but
> > also
> > >>>> the
> > >>>> > > >> origin
> > >>>> > > >> size so we could know how many data is filtered.
> > >>>> > > >>             2. IO time. The scan_blocks_time is not enough.
> If
> > >>>> it is
> > >>>> > > high,
> > >>>> > > >> we know somethings wrong, but not know what cause that
> problem.
> > >>>> The
> > >>>> > real
> > >>>> > > >> IO
> > >>>> > > >> time for data is not be provided. As there may be several
> file
> > >>>> for one
> > >>>> > > >> partition, know the program slow is caused by datanode or
> > >>>> executor
> > >>>> > > itself
> > >>>> > > >> give us intuition to find the problem.
> > >>>> > > >>             3. The TableBlockInfo for task. I log it by
> myself
> > >>>> when
> > >>>> > > >> debugging. It tell me how many blocklets is locality. The
> spark
> > >>>> web
> > >>>> > > >> monitor
> > >>>> > > >> just give a locality level, but may be only one blocklet is
> > >>>> locality.
> > >>>> > > >>
> > >>>> > > >>
> > >>>> > > >> -Regards
> > >>>> > > >> Kumar Vishal
> > >>>> > > >>
> > >>>> > > >> On Thu, Nov 10, 2016 at 8:55 PM, An Lan <lanan...@gmail.com>
> > >>>> wrote:
> > >>>> > > >>
> > >>>> > > >> > Hi,
> > >>>> > > >> >
> > >>>> > > >> > We are using carbondata to build our table and running
> query
> > in
> > >>>> > > >> > CarbonContext. We have some performance problem during
> > >>>> refining the
> > >>>> > > >> system.
> > >>>> > > >> >
> > >>>> > > >> > *Background*：
> > >>>> > > >> >
> > >>>> > > >> > *cluster*：                      100 executor，5
> task/executor,
> > >>>> 10G
> > >>>> > > >> > memory/executor
> > >>>> > > >> >
> > >>>> > > >> > *data*:                              60+GB(per one replica)
> > as
> > >>>> > carbon
> > >>>> > > >> data
> > >>>> > > >> > format, 600+MB/file * 100 file, 300+columns, 300+million
> rows
> > >>>> > > >> >
> > >>>> > > >> > *sql example:*
> > >>>> > > >> >
> > >>>> > > >> >                                       select A,
> > >>>> > > >> >
> > >>>> > > >> >                                       sum(a),
> > >>>> > > >> >
> > >>>> > > >> >                                       sum(b),
> > >>>> > > >> >
> > >>>> > > >> >                                       sum(c),
> > >>>> > > >> >
> > >>>> > > >> >                                       …( extra 100
> > aggregation
> > >>>> like
> > >>>> > > >> > sum(column))
> > >>>> > > >> >
> > >>>> > > >> >                                       from Table1 LATERAL
> > VIEW
> > >>>> > > >> > explode(split(Aarray, ‘*;*’)) ATable AS A
> > >>>> > > >> >
> > >>>> > > >> >                                       where A is not null
> and
> > >>>> d >
> > >>>> > > >> “ab:c-10”
> > >>>> > > >> > and d < “h:0f3s” and e!=10 and f=22 and g=33 and h=44 GROUP
> > BY
> > >>>> A
> > >>>> > > >> >
> > >>>> > > >> > *target query time*:           <10s
> > >>>> > > >> >
> > >>>> > > >> > *current query time*:         15s ~ 25s
> > >>>> > > >> >
> > >>>> > > >> > *scene:*                             OLAP system. <100
> > queries
> > >>>> every
> > >>>> > > >> day.
> > >>>> > > >> > Concurrency number is not high. Most time cpu is idle, so
> > this
> > >>>> > service
> > >>>> > > >> will
> > >>>> > > >> > run with other program. The service will run for long time.
> > We
> > >>>> could
> > >>>> > > not
> > >>>> > > >> > occupy a very large memory for every executor.
> > >>>> > > >> >
> > >>>> > > >> > *refine*:                              I have build index
> and
> > >>>> > > >> dictionary on
> > >>>> > > >> > d, e, f, g, h and build dictionary on all other aggregation
> > >>>> > > >> columns(i.e. a,
> > >>>> > > >> > b, c, …100+ columns). And make sure there is one segment
> for
> > >>>> total
> > >>>> > > >> data. I
> > >>>> > > >> > have open the speculation(quantile=0.5, interval=250,
> > >>>> > multiplier=1.2).
> > >>>> > > >> >
> > >>>> > > >> > Time is mainly spent on first stage before shuffling. As
> 95%
> > >>>> data
> > >>>> > will
> > >>>> > > >> be
> > >>>> > > >> > filtered out, the shuffle process spend little time. In
> first
> > >>>> stage,
> > >>>> > > >> most
> > >>>> > > >> > task complete in less than 10s. But there still be near 50
> > >>>> tasks
> > >>>> > > longer
> > >>>> > > >> > than 10s. Max task time in one query may be 12~16s.
> > >>>> > > >> >
> > >>>> > > >> > *Problem:*
> > >>>> > > >> >
> > >>>> > > >> > 1.          GC problem. We suffer a 20%~30% GC time for
> some
> > >>>> task in
> > >>>> > > >> first
> > >>>> > > >> > stage after a lot of parameter refinement. We now use G1 GC
> > in
> > >>>> > java8.
> > >>>> > > GC
> > >>>> > > >> > time will double if use CMS. The main GC time is spent on
> > young
> > >>>> > > >> generation
> > >>>> > > >> > GC. Almost half memory of young generation will be copy to
> > old
> > >>>> > > >> generation.
> > >>>> > > >> > It seems lots of object has a long life than GC period and
> > the
> > >>>> space
> > >>>> > > is
> > >>>> > > >> not
> > >>>> > > >> > be reuse(as concurrent GC will release it later). When we
> > use a
> > >>>> > large
> > >>>> > > >> > Eden(>=1G for example), once GC time will be seconds. If
> set
> > >>>> Eden
> > >>>> > > >> > little(256M for example), once GC time will be hundreds
> > >>>> > milliseconds,
> > >>>> > > >> but
> > >>>> > > >> > more frequency and total is still seconds. Is there any way
> > to
> > >>>> > lessen
> > >>>> > > >> the
> > >>>> > > >> > GC time? (We don’t consider the first query and second
> query
> > >>>> in this
> > >>>> > > >> case.)
> > >>>> > > >> >
> > >>>> > > >> > 2.           Performance refine problem. Row number after
> > being
> > >>>> > > >> filtered is
> > >>>> > > >> > not uniform. Some node maybe heavy. It spend more time than
> > >>>> other
> > >>>> > > node.
> > >>>> > > >> The
> > >>>> > > >> > time of one task is 4s ~ 16s. Is any method to refine it?
> > >>>> > > >> >
> > >>>> > > >> > 3.           Too long time for first and second query. I
> know
> > >>>> > > dictionary
> > >>>> > > >> > and some index need to be loaded for the first time. But
> > after
> > >>>> I
> > >>>> > > trying
> > >>>> > > >> use
> > >>>> > > >> > query below to preheat it, it still spend a lot of time.
> How
> > >>>> could I
> > >>>> > > >> > preheat the query correctly?
> > >>>> > > >> >
> > >>>> > > >> >               select Aarray, a, b, c… from Table1 where
> > Aarray
> > >>>> is
> > >>>> > not
> > >>>> > > >> null
> > >>>> > > >> > and d = “sss” and e !=22 and f = 33 and g = 44 and h = 55
> > >>>> > > >> >
> > >>>> > > >> > 4.           Any other suggestion to lessen the query time?
> > >>>> > > >> >
> > >>>> > > >> >
> > >>>> > > >> >
> > >>>> > > >> > Some suggestion：
> > >>>> > > >> >
> > >>>> > > >> >             The log by class QueryStatisticsRecorder give
> me
> > a
> > >>>> good
> > >>>> > > >> means
> > >>>> > > >> > to find the neck bottle, but not enough. There still some
> > >>>> metric I
> > >>>> > > >> think is
> > >>>> > > >> > very useful:
> > >>>> > > >> >
> > >>>> > > >> >             1. filter ratio. i.e.. not only result_size but
> > >>>> also the
> > >>>> > > >> origin
> > >>>> > > >> > size so we could know how many data is filtered.
> > >>>> > > >> >
> > >>>> > > >> >             2. IO time. The scan_blocks_time is not enough.
> > If
> > >>>> it is
> > >>>> > > >> high,
> > >>>> > > >> > we know somethings wrong, but not know what cause that
> > >>>> problem. The
> > >>>> > > >> real IO
> > >>>> > > >> > time for data is not be provided. As there may be several
> > file
> > >>>> for
> > >>>> > one
> > >>>> > > >> > partition, know the program slow is caused by datanode or
> > >>>> executor
> > >>>> > > >> itself
> > >>>> > > >> > give us intuition to find the problem.
> > >>>> > > >> >
> > >>>> > > >> >             3. The TableBlockInfo for task. I log it by
> > myself
> > >>>> when
> > >>>> > > >> > debugging. It tell me how many blocklets is locality. The
> > >>>> spark web
> > >>>> > > >> monitor
> > >>>> > > >> > just give a locality level, but may be only one blocklet is
> > >>>> > locality.
> > >>>> > > >> >
> > >>>> > > >> >
> > >>>> > > >> > ---------------------
> > >>>> > > >> >
> > >>>> > > >> > Anning Luo
> > >>>> > > >> >
> > >>>> > > >> > *HULU*
> > >>>> > > >> >
> > >>>> > > >> > Email: anning....@hulu.com
> > >>>> > > >> >
> > >>>> > > >> >             lanan...@gmail.com
> > >>>> > > >> >
> > >>>> > > >>
> > >>>> > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: GC problem and performance refine problem

Reply via email to