[ 
https://issues.apache.org/jira/browse/TAJO-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mai Hai Thanh updated TAJO-1091:
--------------------------------
    Description: 
Accurate selectivity estimations are very useful for the query optimizer to 
choose the best query plan. However, Tajo currently assumes that the 
selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR), 
which is far from correct in many cases. Therefore, I want to improve Tajo's 
ability in selectivity estimation.

There are alternative methods for selectivity estimation, such as histogram, 
wavelet transformation, singular value decomposition, discrete cosine 
transform, kernel estimators, and sampling. Among these methods, histograms 
have been shown to be one of the most popular and effective ways to obtain 
accurate selectivity estimates. Thus, I propose to implement and use histograms 
in Tajo soon. Other methods can be added later if needed. (An ensemble method 
that combines the results of many methods among the above ones to get the best 
selectivity estimate is also attractive)

For histograms, more technically, there are some tasks to do as follows.

1. Implement a selectivity estimation interface
    + For the adding of a new histogram
    + For the use of a histogram in the query optimizer
    + In doing this, should consider the use of different kinds of histograms 
as well as different kinds of selectivity estimation methods

2. Implement some popular kinds of histograms
    + First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
    + Including: histogram construction, maintenance, and selectivity 
estimation given a predicate (both simple and complex predicates)

3. Implement the registration of histogram information to Catalog Server
    + Histograms can be a sub-part of a TableStats
    + The content of a histogram is stored in the Catalog Server

4. Implement a histogram creation and maintenance mechanism
    + Constructing a histogram for a column requires at least a column scan 
(full or random partial scan), so it may take a substantial amount of time. By 
default, histogram creation and maintenance can be triggerred after a default 
amount of time that there is no query processing. On the other hand, the 
creation and maintenance process can also be forced to run by a command, such 
as "ANALYZE table_name"

5. Improve the query optimizer to use histograms
    + If a histogram is available, use histogram instead of the default value

6. Further future improvements
    6.1. A table may have many columns. Creating and maintaining histograms for 
all columns of all tables is too costly. Hence, we should have a method to 
monitor the frequently accessed columns and we create histograms for only these 
special columns.
    6.2. For simplicity, the construction and maintenance processes of a 
histogram are independent of those processes of other histograms. However, for 
efficiency, those processes of different histograms of the same table should be 
done at the same time to exploit cache locality.
    6.3. So far, we have considered only one-dimensional histograms (in which 
EquiWidth and EquiDepth are representatives). These histograms are enough for 
us if the value distributions of all attributes are independent of each other. 
However, this is not always true in real-life data. So, additional techniques 
to detect and exploit dependency of the attributes will be useful.
    6.4. The selectivity of past predicates (of past queries) should be use in 
some ways to improve the future selectivity estimation.
    6.5. For each column, we should find the list of Most Common Values and 
compute their selectivities. This kind of information will be very useful when 
used together with histograms.


  was:
Accurate selectivity estimations are very useful for the query optimizer to 
choose the best query plan. However, Tajo currently assumes that the 
selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR), 
which is far from correct in many cases. Therefore, I want to improve Tajo's 
ability in selectivity estimation.

There are alternative methods for selectivity estimation, such as histogram, 
wavelet transformation, singular value decomposition, discrete cosine 
transform, kernel estimators, and sampling. Among these methods, histograms 
have been shown to be one of the most popular and effective ways to obtain 
accurate selectivity estimates. Thus, I propose to implement and use histograms 
in Tajo soon. Other methods can be added later if needed. (An ensemble method 
that combines the results of many methods among the above ones to get the best 
selectivity estimate is also attractive)

For histograms, more technically, there are some tasks to do as follows.

1. Implement a selectivity estimation interface
    + For the adding of a new histogram
    + For the use of a histogram in the query optimizer
    + In doing this, should consider the use of different kinds of histograms 
as well as different kinds of selectivity estimation methods

2. Implement some popular kinds of histograms
    + First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
    + Including: histogram construction, maintenance, and selectivity 
estimation given a predicate (both simple and complex predicates)

3. Implement the registration of histogram information to Catalog Server
    + Histograms can be a sub-part of a TableStats
    + The content of a histogram is stored in a file (Question: should this 
file be stored only in TajoMaster ? or replicated in all TajoWorkers ?)

4. Implement a histogram creation and maintenance mechanism
    + Constructing a histogram for a column requires at least a column scan 
(full or random partial scan), so it may take a substantial amount of time. By 
default, histogram creation and maintenance can be triggerred after a default 
amount of time that there is no query processing. On the other hand, the 
creation and maintenance process can also be forced to run by a command through 
the CLI

5. Improve the query optimizer to use histograms
    + If a histogram is available, use histogram instead of the default value

6. Further future improvements
    6.1. A table may have many columns. Creating and maintaining histograms for 
all columns of all tables is too costly. Hence, we should have a method to 
monitor the frequently accessed columns and we create histograms for only these 
special columns.
    6.2. For simplicity, the construction and maintenance processes of a 
histogram are independent of those processes of other histograms. However, for 
efficiency, those processes of different histograms of the same table should be 
done at the same time to exploit cache locality.
    6.3. So far, we have considered only one-dimensional histograms (in which 
EquiWidth and EquiDepth are representatives). These histograms are enough for 
us if the value distributions of all attributes are independent of each other. 
However, this is not always true in real-life data. So, additional techniques 
to detect and exploit dependency of the attributes will be useful.
    6.4. The selectivity of past predicates (of past queries) should be use in 
some ways to improve the future selectivity estimation.



> (Umbrella) Improve selectivity estimation by histograms
> -------------------------------------------------------
>
>                 Key: TAJO-1091
>                 URL: https://issues.apache.org/jira/browse/TAJO-1091
>             Project: Tajo
>          Issue Type: Improvement
>            Reporter: Mai Hai Thanh
>
> Accurate selectivity estimations are very useful for the query optimizer to 
> choose the best query plan. However, Tajo currently assumes that the 
> selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR), 
> which is far from correct in many cases. Therefore, I want to improve Tajo's 
> ability in selectivity estimation.
> There are alternative methods for selectivity estimation, such as histogram, 
> wavelet transformation, singular value decomposition, discrete cosine 
> transform, kernel estimators, and sampling. Among these methods, histograms 
> have been shown to be one of the most popular and effective ways to obtain 
> accurate selectivity estimates. Thus, I propose to implement and use 
> histograms in Tajo soon. Other methods can be added later if needed. (An 
> ensemble method that combines the results of many methods among the above 
> ones to get the best selectivity estimate is also attractive)
> For histograms, more technically, there are some tasks to do as follows.
> 1. Implement a selectivity estimation interface
>     + For the adding of a new histogram
>     + For the use of a histogram in the query optimizer
>     + In doing this, should consider the use of different kinds of histograms 
> as well as different kinds of selectivity estimation methods
> 2. Implement some popular kinds of histograms
>     + First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
>     + Including: histogram construction, maintenance, and selectivity 
> estimation given a predicate (both simple and complex predicates)
> 3. Implement the registration of histogram information to Catalog Server
>     + Histograms can be a sub-part of a TableStats
>     + The content of a histogram is stored in the Catalog Server
> 4. Implement a histogram creation and maintenance mechanism
>     + Constructing a histogram for a column requires at least a column scan 
> (full or random partial scan), so it may take a substantial amount of time. 
> By default, histogram creation and maintenance can be triggerred after a 
> default amount of time that there is no query processing. On the other hand, 
> the creation and maintenance process can also be forced to run by a command, 
> such as "ANALYZE table_name"
> 5. Improve the query optimizer to use histograms
>     + If a histogram is available, use histogram instead of the default value
> 6. Further future improvements
>     6.1. A table may have many columns. Creating and maintaining histograms 
> for all columns of all tables is too costly. Hence, we should have a method 
> to monitor the frequently accessed columns and we create histograms for only 
> these special columns.
>     6.2. For simplicity, the construction and maintenance processes of a 
> histogram are independent of those processes of other histograms. However, 
> for efficiency, those processes of different histograms of the same table 
> should be done at the same time to exploit cache locality.
>     6.3. So far, we have considered only one-dimensional histograms (in which 
> EquiWidth and EquiDepth are representatives). These histograms are enough for 
> us if the value distributions of all attributes are independent of each 
> other. However, this is not always true in real-life data. So, additional 
> techniques to detect and exploit dependency of the attributes will be useful.
>     6.4. The selectivity of past predicates (of past queries) should be use 
> in some ways to improve the future selectivity estimation.
>     6.5. For each column, we should find the list of Most Common Values and 
> compute their selectivities. This kind of information will be very useful 
> when used together with histograms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to