[ 
https://issues.apache.org/jira/browse/TAJO-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174524#comment-14174524
 ] 

Hyunsik Choi commented on TAJO-1120:
------------------------------------

Jihoon,

Thank you for your nice proposal. 

As you know, there are many statistics (e.g., histogram, distinct number of 
values, min/max values, null density, and so on) which typically are used in 
this field. Collecting them from storing a table may require severe CPU and I/O 
burden. In my opinion, we need more consideration for this feature.

Basically, each statistic type requires following operation:

For example:
- histogram requires - sort (in-memory approach if data set is small)
- distinct number of values - sort (hash if data set is small)
- min/max values - a full scan
- number of NULL values -  a full scan
- the number of rows -  a full scan
- average length of text - a full scan

First of all, we need the range of this feature. For example, which statistic 
informations will be collected during table writing. In addition, we can 
achieve this feature in multiple steps. In this case, we need some roadmap for 
it. Could you elaborate the detailed plan and your roadmap for this feature?

> Enable collecting column stats when storing a table if necessary
> ----------------------------------------------------------------
>
>                 Key: TAJO-1120
>                 URL: https://issues.apache.org/jira/browse/TAJO-1120
>             Project: Tajo
>          Issue Type: Improvement
>          Components: catalog
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>             Fix For: 0.9.1
>
>
> Currently, the number of null values and the max/min values of a column are 
> collected only in the shuffle stage.
> In addition, the number of distinct values of a column seems not to be 
> collected in anywhere. 
> However, some recent issues such as TAJO-838 and TAJO-1091 require these 
> statistics, and thus we need to collect them for tables that are newly stored 
> via CTAS or INSERT INTO statements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to