[
https://issues.apache.org/jira/browse/TAJO-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mai Hai Thanh updated TAJO-1091:
--------------------------------
Description:
Accurate selectivity estimations are very useful for the query optimizer to
choose the best query plan. However, Tajo currently assumes that the
selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR),
which is far from correct in many cases. Therefore, I want to improve Tajo's
ability in selectivity estimation.
There are alternative methods for selectivity estimation, such as histogram,
wavelet transformation, singular value decomposition, discrete cosine
transform, kernel estimators, and sampling. Among these methods, histograms
have been shown to be one of the most popular and effective ways to obtain
accurate selectivity estimates. Thus, I propose to implement and use histograms
in Tajo soon. Other methods can be added later if needed. (An ensemble method
that combines the results of many methods among the above ones to get the best
selectivity estimate is also attractive)
For histograms, more technically, there are some tasks to do as follows.
1. Implement a selectivity estimation interface
+ For the adding of a new histogram
+ For the use of a histogram in the query optimizer
+ In doing this, should consider the use of different kinds of histograms
as well as different kinds of selectivity estimation methods
2. Implement some popular kinds of histograms
+ First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
+ Including: histogram construction, maintenance, and selectivity
estimation given a predicate (both simple and complex predicates)
3. Implement the registration of histogram information to Catalog Server
+ Histograms can be a sub-part of a TableStats
+ The content of a histogram is stored in the Catalog Server
4. Implement a histogram creation and maintenance mechanism
+ Constructing a histogram for a column requires at least a column scan
(full or random partial scan), so it may take a substantial amount of time. By
default, histogram creation and maintenance can be triggerred after a default
amount of time that there is no query processing. On the other hand, the
creation and maintenance process can also be forced to run by a command, such
as "ANALYZE table_name"
5. Improve the query optimizer to use histograms
+ If a histogram is available, use histogram instead of the default value
6. Further future improvements
6.1. A table may have many columns. Creating and maintaining histograms for
all columns of all tables is too costly. Hence, we should have a method to
monitor the frequently accessed columns and we create histograms for only these
special columns.
6.2. For simplicity, the construction and maintenance processes of a
histogram are independent of those processes of other histograms. However, for
efficiency, those processes of different histograms of the same table should be
done at the same time to exploit cache locality.
6.3. So far, we have considered only one-dimensional histograms (in which
EquiWidth and EquiDepth are representatives). These histograms are enough for
us if the value distributions of all attributes are independent of each other.
However, this is not always true in real-life data. So, additional techniques
to detect and exploit dependency of the attributes will be useful.
6.4. The selectivity of past predicates (of past queries) should be use in
some ways to improve the future selectivity estimation.
6.5. For each column, we should find the list of Most Common Values and
compute their selectivities. This kind of information will be very useful when
used together with histograms.
was:
Accurate selectivity estimations are very useful for the query optimizer to
choose the best query plan. However, Tajo currently assumes that the
selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR),
which is far from correct in many cases. Therefore, I want to improve Tajo's
ability in selectivity estimation.
There are alternative methods for selectivity estimation, such as histogram,
wavelet transformation, singular value decomposition, discrete cosine
transform, kernel estimators, and sampling. Among these methods, histograms
have been shown to be one of the most popular and effective ways to obtain
accurate selectivity estimates. Thus, I propose to implement and use histograms
in Tajo soon. Other methods can be added later if needed. (An ensemble method
that combines the results of many methods among the above ones to get the best
selectivity estimate is also attractive)
For histograms, more technically, there are some tasks to do as follows.
1. Implement a selectivity estimation interface
+ For the adding of a new histogram
+ For the use of a histogram in the query optimizer
+ In doing this, should consider the use of different kinds of histograms
as well as different kinds of selectivity estimation methods
2. Implement some popular kinds of histograms
+ First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
+ Including: histogram construction, maintenance, and selectivity
estimation given a predicate (both simple and complex predicates)
3. Implement the registration of histogram information to Catalog Server
+ Histograms can be a sub-part of a TableStats
+ The content of a histogram is stored in a file (Question: should this
file be stored only in TajoMaster ? or replicated in all TajoWorkers ?)
4. Implement a histogram creation and maintenance mechanism
+ Constructing a histogram for a column requires at least a column scan
(full or random partial scan), so it may take a substantial amount of time. By
default, histogram creation and maintenance can be triggerred after a default
amount of time that there is no query processing. On the other hand, the
creation and maintenance process can also be forced to run by a command through
the CLI
5. Improve the query optimizer to use histograms
+ If a histogram is available, use histogram instead of the default value
6. Further future improvements
6.1. A table may have many columns. Creating and maintaining histograms for
all columns of all tables is too costly. Hence, we should have a method to
monitor the frequently accessed columns and we create histograms for only these
special columns.
6.2. For simplicity, the construction and maintenance processes of a
histogram are independent of those processes of other histograms. However, for
efficiency, those processes of different histograms of the same table should be
done at the same time to exploit cache locality.
6.3. So far, we have considered only one-dimensional histograms (in which
EquiWidth and EquiDepth are representatives). These histograms are enough for
us if the value distributions of all attributes are independent of each other.
However, this is not always true in real-life data. So, additional techniques
to detect and exploit dependency of the attributes will be useful.
6.4. The selectivity of past predicates (of past queries) should be use in
some ways to improve the future selectivity estimation.
> (Umbrella) Improve selectivity estimation by histograms
> -------------------------------------------------------
>
> Key: TAJO-1091
> URL: https://issues.apache.org/jira/browse/TAJO-1091
> Project: Tajo
> Issue Type: Improvement
> Reporter: Mai Hai Thanh
>
> Accurate selectivity estimations are very useful for the query optimizer to
> choose the best query plan. However, Tajo currently assumes that the
> selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR),
> which is far from correct in many cases. Therefore, I want to improve Tajo's
> ability in selectivity estimation.
> There are alternative methods for selectivity estimation, such as histogram,
> wavelet transformation, singular value decomposition, discrete cosine
> transform, kernel estimators, and sampling. Among these methods, histograms
> have been shown to be one of the most popular and effective ways to obtain
> accurate selectivity estimates. Thus, I propose to implement and use
> histograms in Tajo soon. Other methods can be added later if needed. (An
> ensemble method that combines the results of many methods among the above
> ones to get the best selectivity estimate is also attractive)
> For histograms, more technically, there are some tasks to do as follows.
> 1. Implement a selectivity estimation interface
> + For the adding of a new histogram
> + For the use of a histogram in the query optimizer
> + In doing this, should consider the use of different kinds of histograms
> as well as different kinds of selectivity estimation methods
> 2. Implement some popular kinds of histograms
> + First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
> + Including: histogram construction, maintenance, and selectivity
> estimation given a predicate (both simple and complex predicates)
> 3. Implement the registration of histogram information to Catalog Server
> + Histograms can be a sub-part of a TableStats
> + The content of a histogram is stored in the Catalog Server
> 4. Implement a histogram creation and maintenance mechanism
> + Constructing a histogram for a column requires at least a column scan
> (full or random partial scan), so it may take a substantial amount of time.
> By default, histogram creation and maintenance can be triggerred after a
> default amount of time that there is no query processing. On the other hand,
> the creation and maintenance process can also be forced to run by a command,
> such as "ANALYZE table_name"
> 5. Improve the query optimizer to use histograms
> + If a histogram is available, use histogram instead of the default value
> 6. Further future improvements
> 6.1. A table may have many columns. Creating and maintaining histograms
> for all columns of all tables is too costly. Hence, we should have a method
> to monitor the frequently accessed columns and we create histograms for only
> these special columns.
> 6.2. For simplicity, the construction and maintenance processes of a
> histogram are independent of those processes of other histograms. However,
> for efficiency, those processes of different histograms of the same table
> should be done at the same time to exploit cache locality.
> 6.3. So far, we have considered only one-dimensional histograms (in which
> EquiWidth and EquiDepth are representatives). These histograms are enough for
> us if the value distributions of all attributes are independent of each
> other. However, this is not always true in real-life data. So, additional
> techniques to detect and exploit dependency of the attributes will be useful.
> 6.4. The selectivity of past predicates (of past queries) should be use
> in some ways to improve the future selectivity estimation.
> 6.5. For each column, we should find the list of Most Common Values and
> compute their selectivities. This kind of information will be very useful
> when used together with histograms.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)