[
https://issues.apache.org/jira/browse/HIVE-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Garima Dosi updated HIVE-15352:
-------------------------------
Attachment: Hive MVCC - Requirement & Design.pdf
> MVCC (Multi Versioned Concurrency Control) in Hive
> --------------------------------------------------
>
> Key: HIVE-15352
> URL: https://issues.apache.org/jira/browse/HIVE-15352
> Project: Hive
> Issue Type: New Feature
> Reporter: Garima Dosi
> Attachments: Hive MVCC - Requirement & Design.pdf
>
>
> Use Case
> While working with providing solutions for various applications, we see that
> there is at times, a need to provide multi version concurrency support for
> certain datasets. The requirement of multi versioned concurrency is mainly
> due to two reasons –
> • Simultaneous querying and loading from tables or datasets, which requires
> maintaining versions for reading and writing (Locking is not the right option
> here)
> • Maintaining historical load of tables/datasets upto some extent
> Both of these requirements are seen in data management systems (warehouses
> etc).
> What happens without MVCC in Hive?
> In cases, where MVCC had to be done, design similar to this -
> https://dzone.com/articles/zookeeper-a-real-world-example-of-how-to-use-it
> was followed to make it work. Zookeeper was used to maintain versions and
> provide MVCC support. However, this design poses a limitation if a normal
> user would like to query a hive table because he will not be aware of the
> current version to be queried. The additional layer to match versions in
> zookeeper with the dataset to be queried introduces a bit of an overhead for
> normal users and hence, the request to make this feature available in Hive.
> Hive Design for Support of MVCC
> The hive design for MVCC support can be as described below (It would somewhat
> follow the article mentioned in the previous section) –
> 1. The first thing should be the ability for the user to specify that this is
> a MVCC table. So, a DDL something like this –
> create table <table_name> ( <column_specs>) MULTI_VERSIONED ON [sequence,
> time]
> Internally this DDL can be translated to a partitioned table either on a
> sequence number (auto-generated by Hive) or a timestamp. The metastore would
> keep this information.
> 2. DMLs related to inserting or loading data to the table would remain the
> same for an end user. However, internally Hive would automatically detect
> that a table is a multi-versioned table and write the new data to a new
> partition with a new version of the dataset. The Hive Metastore would also be
> updated with the current version.
> 3. DMLs related to querying data from the table would remain the same for a
> user. However, internally Hive would use the latest version for queries.
> Latest version is always stored in the metastore.
> Management of obsolete versions
> The obsolete versions can be deleted based on the following –
> 1.Either a setting which simply says delete the version which is older than a
> threshold and is not active, OR
> 2.By tracking the count of queries running on older versions and deleting the
> ones which are not the latest and are not being used by any query. This would
> require some sort of a background thread monitoring the table for obsolete
> versions. As shown in the article mentioned above, this would also require
> incrementing version count whenever a version is queried and decrement it
> once the query is done.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)