[jira] [Updated] (IMPALA-10083) Improve row count estimates when stats are not available

Sahil Takiar (Jira) Thu, 13 Aug 2020 10:14:22 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-10083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sahil Takiar updated IMPALA-10083:
----------------------------------
    Description: 
There are various improvements that we can make to estimate row count stats 
even if stats are not available for a table.

There are various factors to consider here:
 * Handling for partitioned vs. non-partitioned tables
 ** Handling for partitioned tables can be a bit tricky if the table is in a 
mixed state - some partitions have row counts while other don't
 * Interoperability with other systems such as Hive and Spark
 * Users can run alter table statements to manually set the value of the row 
count
 * Handling of corrupt stats vs. missing stats
 ** Corrupt stats are defined as stats value less than -1, or values of 0 when 
the underlying table has nonempty files
 ** Missing stats are stats that have just not been computed, and are marked as 
such with the value -1

The JIRA will be used to track the various improvements via sub-tasks.

  was:
There are various improvements that we can make to estimate row count stats 
even if stats are not available for a table.

There are various factors to consider here:
 * Handling for partitioned vs. non-partitioned tables
 ** Handling for partitioned tables can be a bit tricky if the table is in a 
mixed state - some partitions have row counts while other don't
 * Interoperability with other systems such as Hive and Spark
 * Users can run alter table statements to manually set the value of the row 
count

The JIRA will be used to track the various improvements via sub-tasks.


> Improve row count estimates when stats are not available
> --------------------------------------------------------
>
>                 Key: IMPALA-10083
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10083
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Sahil Takiar
>            Priority: Major
>
> There are various improvements that we can make to estimate row count stats 
> even if stats are not available for a table.
> There are various factors to consider here:
>  * Handling for partitioned vs. non-partitioned tables
>  ** Handling for partitioned tables can be a bit tricky if the table is in a 
> mixed state - some partitions have row counts while other don't
>  * Interoperability with other systems such as Hive and Spark
>  * Users can run alter table statements to manually set the value of the row 
> count
>  * Handling of corrupt stats vs. missing stats
>  ** Corrupt stats are defined as stats value less than -1, or values of 0 
> when the underlying table has nonempty files
>  ** Missing stats are stats that have just not been computed, and are marked 
> as such with the value -1
> The JIRA will be used to track the various improvements via sub-tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-10083) Improve row count estimates when stats are not available

Reply via email to