Sergey Shelukhin commented on HIVE-19418:
Let me fix in an addendum commit on branch-3.
We've increased timeouts for TestStatsUpdaterThread, however it may be that for
one of the tests the timeout needs to be increased further.
> add background stats updater similar to compactor
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Fix For: 3.1.0, 4.0.0
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch,
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch,
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch,
> HIVE-19418.07.patch, HIVE-19418.patch
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables
> to make them usable in a transaction without breaking ACID (for metadata-only
> optimization). However, stats for ACID tables can still become unusable if
> e.g. two parallel inserts run - neither sees the data written by the other,
> so after both finish, the snapshots on either set of stats won't match the
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot
> derive min/max if some rows are deleted but you don't scan the rest of the
> Therefore we will add background logic to metastore (similar to, and
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can
> be expensive, so if the user is only analyzing a subset of tables it should
> be able to only update that subset). We can simply look at existing stats and
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update.
> In phase 1, the process will operate outside of compactor, and run analyze
> command on the table. The analyze command will automatically save the stats
> with ACID snapshot information if needed, based on HIVE-19416, so we don't
> need to do any special state management and this will work for all table
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that
> uses a temp table. If we don't have open writers during major compaction (so
> we overwrite all of the data), the temp table stats can simply be copied over
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor
> that is not query based, the same way as we'd do for (2). Alternatively we
> can wait for ACID compactor to become query based and just reuse (2).
This message was sent by Atlassian JIRA