[
https://issues.apache.org/jira/browse/IMPALA-10083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar updated IMPALA-10083:
----------------------------------
Description:
There are various improvements that we can make to estimate row count stats
even if stats are not available for a table.
There are various factors to consider here:
* Handling for partitioned vs. non-partitioned tables
** Handling for partitioned tables can be a bit tricky if the table is in a
mixed state - some partitions have row counts while other don't
* Interoperability with other systems such as Hive and Spark
* Users can run alter table statements to manually set the value of the row
count
* Handling of corrupt stats vs. missing stats
** Corrupt stats are defined as stats value less than -1, or values of 0 when
the underlying table has nonempty files
** Missing stats are stats that have just not been computed, and are marked as
such with the value -1
The JIRA will be used to track the various improvements via sub-tasks.
was:
There are various improvements that we can make to estimate row count stats
even if stats are not available for a table.
There are various factors to consider here:
* Handling for partitioned vs. non-partitioned tables
** Handling for partitioned tables can be a bit tricky if the table is in a
mixed state - some partitions have row counts while other don't
* Interoperability with other systems such as Hive and Spark
* Users can run alter table statements to manually set the value of the row
count
The JIRA will be used to track the various improvements via sub-tasks.
> Improve row count estimates when stats are not available
> --------------------------------------------------------
>
> Key: IMPALA-10083
> URL: https://issues.apache.org/jira/browse/IMPALA-10083
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Sahil Takiar
> Priority: Major
>
> There are various improvements that we can make to estimate row count stats
> even if stats are not available for a table.
> There are various factors to consider here:
> * Handling for partitioned vs. non-partitioned tables
> ** Handling for partitioned tables can be a bit tricky if the table is in a
> mixed state - some partitions have row counts while other don't
> * Interoperability with other systems such as Hive and Spark
> * Users can run alter table statements to manually set the value of the row
> count
> * Handling of corrupt stats vs. missing stats
> ** Corrupt stats are defined as stats value less than -1, or values of 0
> when the underlying table has nonempty files
> ** Missing stats are stats that have just not been computed, and are marked
> as such with the value -1
> The JIRA will be used to track the various improvements via sub-tasks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]