[
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526097#comment-16526097
]
Peter Vary commented on HIVE-19418:
-----------------------------------
[~sershe], [~alangates]: As per our discussion on hive-dev list (See: [Apache
dev list
archive|http://mail-archives.apache.org/mod_mbox/hive-dev/201712.mbox/%3CCDF09DF1-746E-4A4F-8644-8B441F386937%40cloudera.com%3E]),
I think the consensus was that we would like to keep the original exceptions
for the MetaStore Thrift API.
Shall I create a new patch which reverts back the part of this change where in
{{HiveMetaStoreClient.createTable}} we started to throw
{{InvalidOperationException}} instead of the original {{MetaException}}. Or we
see enough reasons to reopen the discussion?
Thanks,
Peter
> add background stats updater similar to compactor
> -------------------------------------------------
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch,
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch,
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch,
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables
> to make them usable in a transaction without breaking ACID (for metadata-only
> optimization). However, stats for ACID tables can still become unusable if
> e.g. two parallel inserts run - neither sees the data written by the other,
> so after both finish, the snapshots on either set of stats won't match the
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot
> derive min/max if some rows are deleted but you don't scan the rest of the
> dataset).
> Therefore we will add background logic to metastore (similar to, and
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can
> be expensive, so if the user is only analyzing a subset of tables it should
> be able to only update that subset). We can simply look at existing stats and
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update.
> In phase 1, the process will operate outside of compactor, and run analyze
> command on the table. The analyze command will automatically save the stats
> with ACID snapshot information if needed, based on HIVE-19416, so we don't
> need to do any special state management and this will work for all table
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that
> uses a temp table. If we don't have open writers during major compaction (so
> we overwrite all of the data), the temp table stats can simply be copied over
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor
> that is not query based, the same way as we'd do for (2). Alternatively we
> can wait for ACID compactor to become query based and just reuse (2).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)