[ 
https://issues.apache.org/jira/browse/FLINK-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166006#comment-16166006
 ] 

ASF GitHub Bot commented on FLINK-6516:
---------------------------------------

Github user godfreyhe commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3860#discussion_r138843966
  
    --- Diff: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/PhysicalTableSourceScan.scala
 ---
    @@ -70,4 +73,18 @@ abstract class PhysicalTableSourceScan(
     
       def copy(traitSet: RelTraitSet, tableSource: TableSource[_]): 
PhysicalTableSourceScan
     
    +  override def estimateRowCount(mq: RelMetadataQuery): Double = {
    +    val tableSourceTable = getTable.unwrap(classOf[TableSourceTable[_]])
    +
    +    if (tableSourceTable.getStatistic != FlinkStatistic.UNKNOWN) {
    --- End diff --
    
    hi @fhueske, so sorry for late response. Some time ago, my work focus is 
not on Table API & SQL, now I will refocus on it.
    
    There are mainly 4 cases:
    1. `DataSetTable` or `DataStreamTable`, default statistics (row count = 
1000).
    2. `TableSourceTable` is registered without catalog, no statistic now. (we 
can get statistics from `TableSource`)
    3. `TableSourceTable` is in catalog, and the catalog constains statistics.
      3.1. If the `TableSource` is filterable (or partitionable) TableSource, 
maybe we can not use the catalog's statistics any more, should use 
`TableSource` statistics. (such as Parquet table source)
      3.2  If the `TableSource` is non-filterable (or non-partitionable) 
TableSource, we are prefer catalog's statistics because of more efficient 
access.
    4. `TableSourceTable` is in catalog, but the catalog does not have 
statistics. (get statistics from `TableSource`)
    
    Furthermore, the statistics in catalog may be wrong, someone want the 
statistics from `TableSource`.
    
    So, it's too difficult to let framework choose the statistics source. I 
prefer that let user choose the statistics source. There are two approach:
    1. a simple way: add config(`var preferCatalogStats: Boolean`) in 
`TableConfig`, user can choose prefer statistics source by table config. If 
`preferCatalogStats` is true, framework will use catalog statistics first; if 
the statistics is null (or unknown), framework use `TableSource` statistics. If 
`preferCatalogStats` is false, the access order will be reversed.
    
    2. a complex way: Let user decide the statistics source when register the 
table. We can change the `registerTable` and `registerTableSource` methods in 
`TableEnvironment`:
    
    ```
    // register table with statistic, the framework will always use the given 
statistic. 
    def registerTable(name: String, table: Table, statistic = 
FlinkStatistic.of(TableStats(1000L)))
    
    // register table source with user prefer statistics source.
    def registerTableSource(name: String, tableSource: TableSource[_], 
preferCatalogStats: Boolean=false): Unit
    
    // register table source with statistic, the framework will always use the 
given statistic. 
    def registerTableSource(name: String, tableSource: TableSource[_], 
statistic: FlinkStatistic): Unit
    ```
    So, user can choose statistics source and add more accurate statistics for 
each table.
    
    Looking forward to your advice, many thanks!


> using real row count instead of dummy row count when optimizing plan
> --------------------------------------------------------------------
>
>                 Key: FLINK-6516
>                 URL: https://issues.apache.org/jira/browse/FLINK-6516
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table API & SQL
>            Reporter: godfrey he
>            Assignee: godfrey he
>
> Currently, the statistic of {{TableSourceTable}} is {{UNKNOWN}} mostly, and 
> the statistic from {{ExternalCatalog}} maybe is null also. Actually, only 
> each {{TableSource}} knows its statistic exactly, especial for 
> {{FilterableTableSource}} and {{PartitionableTableSource}}. So we can add 
> {{getTableStats}} method in {{TableSource}}, and use it in TableSourceScan's 
> estimateRowCount method to get real row count.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to