[jira] [Created] (HIVE-19719) Adding metastore batch API for partitions

Alexander Kolbasov (JIRA) Fri, 25 May 2018 16:38:45 -0700

Alexander Kolbasov created HIVE-19719:
-----------------------------------------


             Summary: Adding metastore batch API for partitions
                 Key: HIVE-19719
                 URL: https://issues.apache.org/jira/browse/HIVE-19719
             Project: Hive
          Issue Type: Improvement
          Components: Metastore
    Affects Versions: 3.1.0, 4.0.0
            Reporter: Alexander Kolbasov


Hive Metastore provides APIs for fetching a collection of objects (usually 
tables or partitions). These APIs provide a way to fetch all available objects 
so the size of the response is O(N) where N is the number of objects. These 
calls have several problems:

* All objects (and there may be thousands or even millions) should be fetched 
from the database, serialized to Java list of thrift objects then serialized 
into byte array for sending over the network. This creates spikes of huge 
memory pressure, especially since in some cases multiple of copies of the same 
data are present in memory (e.g. unserialized and serialized versions).
* Even though HMS tries to avoid string duplication by use of string interning 
in JAVA, duplicated strings must be serialized in the output array.
* Java has 2Gb limit on the maximum size of byte array, and crashes with Out Of 
Memory exception if this array size is exceeded
* Fetching huge amount of objects blows up DB caches and memory caches in the 
system.
Receiving such huge messages also creates memory pressure on the receiver side 
(usually HS2) which can cause it crashing with Out of Memory exception as well.
* Such requests have very big latencies since the server must collect all 
objects, serialize them and send them all to the network before the client can 
do anything with the result.

To prevent cases of Out Of Memory exceptions, the server now has a configurable 
limit on the maximum number of objects returned. This helps to avoid crashes, 
but doesn’t allow for correct query execution since the result will include 
random and incomplete set of K objects.

Currently this is addressed on the client side by simulating batching by 
getting list of table or partition names first and then requesting table 
information for parts of this list. Still, the list of objects can be big as 
well and this method requires locking to ensure that objects are not added or 
removed between the calls, especially if this is done outside of HS2.

Instead we can do simple modification of existing APIs which allows for batch 
iterator-style operations without keeping any server-side state. The main idea 
is to have a unique incrementing IDs for each objects. The IDs should be only 
unique within their container (e.g. table IDs should be unique within a 
database and partition IDs should be unique within a table). 
Such ID can be easily generated using database auto-increment mechanism or we 
can be simply reuse existing ID column that is already maintained by the Data 
Nucleus.
The request is then modified to include

* Starting ID i0
* Batch size (B)

The server fetches up to B objects starting from i0, serlalizes them and sends 
to the client. The client then requests next batch by using the ID of the last 
received request plus one. It is possible to construct an SQL query (either by 
using DataNucleus JDOQL or in DirectSQL code) which only selects needed objects 
avoiding big reads from the database. The client then iterates until it fetches 
all the objects and each request memory size is limited by the value of batch 
size.
        If we extend the API a little bit, providing a way to get the minimum 
and maximum ID values (either via a separate call or piggybacked to the normal 
reply), clients can request such batches concurrently, thus also reducing the 
latency. Clients can easily estimate number of batches by knowing the total 
number of IDs. While this isn’t a precise method it is good enough to divide 
the work.

It is also possible to wrap this in a way similar to {{PartitionIterator}} and 
async-fetch next batch while we are processing current batch.

* Consistency considerations*
* 
HMS only provides consistency guarantees for a single call. The set of objects 
that should be returned may change while we are iterating over it. In some 
cases this is not an issue since HS2 may use ZooKeeper locks on the table to 
prevent modifications, but in some cases this may be an issue (for example for 
calls that originate from external systems. We should consider additions and 
removals separately.

* New objects are added during iteration. All new objects are always added at 
the ‘end’ of ID space, so they will be always picked up by the iterator. We 
assume that IDs are always incrementing.
* Some objects are removed during iteration. Removal of objects that are not 
already consumed is not a problem.  It is possible that some objects which were 
already consumed are returned. Although this results in an inconsistent list of 
objects, this situation is indistinguishable from the situation when these 
objects were removed immediately after we got all objects in one atomic call. 
So it doens’t seem to be a practical issue.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-19719) Adding metastore batch API for partitions

Reply via email to