[jira] [Updated] (HIVE-19719) Adding metastore batch API for partitions

Alexander Kolbasov (JIRA) Fri, 25 May 2018 16:39:16 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-19719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Kolbasov updated HIVE-19719:
--------------------------------------
    Description: 
Hive Metastore provides APIs for fetching a collection of objects (usually 
tables or partitions). These APIs provide a way to fetch all available objects 
so the size of the response is O(N) where N is the number of objects. These 
calls have several problems:

* All objects (and there may be thousands or even millions) should be fetched 
from the database, serialized to Java list of thrift objects then serialized 
into byte array for sending over the network. This creates spikes of huge 
memory pressure, especially since in some cases multiple of copies of the same 
data are present in memory (e.g. unserialized and serialized versions).
* Even though HMS tries to avoid string duplication by use of string interning 
in JAVA, duplicated strings must be serialized in the output array.
* Java has 2Gb limit on the maximum size of byte array, and crashes with Out Of 
Memory exception if this array size is exceeded
* Fetching huge amount of objects blows up DB caches and memory caches in the 
system.
Receiving such huge messages also creates memory pressure on the receiver side 
(usually HS2) which can cause it crashing with Out of Memory exception as well.
* Such requests have very big latencies since the server must collect all 
objects, serialize them and send them all to the network before the client can 
do anything with the result.

To prevent cases of Out Of Memory exceptions, the server now has a configurable 
limit on the maximum number of objects returned. This helps to avoid crashes, 
but doesn’t allow for correct query execution since the result will include 
random and incomplete set of K objects.

Currently this is addressed on the client side by simulating batching by 
getting list of table or partition names first and then requesting table 
information for parts of this list. Still, the list of objects can be big as 
well and this method requires locking to ensure that objects are not added or 
removed between the calls, especially if this is done outside of HS2.

Instead we can do simple modification of existing APIs which allows for batch 
iterator-style operations without keeping any server-side state. The main idea 
is to have a unique incrementing IDs for each objects. The IDs should be only 
unique within their container (e.g. table IDs should be unique within a 
database and partition IDs should be unique within a table). 
Such ID can be easily generated using database auto-increment mechanism or we 
can be simply reuse existing ID column that is already maintained by the Data 
Nucleus.
The request is then modified to include

* Starting ID i0
* Batch size (B)

The server fetches up to B objects starting from i0, serlalizes them and sends 
to the client. The client then requests next batch by using the ID of the last 
received request plus one. It is possible to construct an SQL query (either by 
using DataNucleus JDOQL or in DirectSQL code) which only selects needed objects 
avoiding big reads from the database. The client then iterates until it fetches 
all the objects and each request memory size is limited by the value of batch 
size.
        If we extend the API a little bit, providing a way to get the minimum 
and maximum ID values (either via a separate call or piggybacked to the normal 
reply), clients can request such batches concurrently, thus also reducing the 
latency. Clients can easily estimate number of batches by knowing the total 
number of IDs. While this isn’t a precise method it is good enough to divide 
the work.

It is also possible to wrap this in a way similar to {{PartitionIterator}} and 
async-fetch next batch while we are processing current batch.

*Consistency considerations*

* HMS only provides consistency guarantees for a single call. The set of 
objects that should be returned may change while we are iterating over it. In 
some cases this is not an issue since HS2 may use ZooKeeper locks on the table 
to prevent modifications, but in some cases this may be an issue (for example 
for calls that originate from external systems. We should consider additions 
and removals separately.
* New objects are added during iteration. All new objects are always added at 
the ‘end’ of ID space, so they will be always picked up by the iterator. We 
assume that IDs are always incrementing.
* Some objects are removed during iteration. Removal of objects that are not 
already consumed is not a problem.  It is possible that some objects which were 
already consumed are returned. Although this results in an inconsistent list of 
objects, this situation is indistinguishable from the situation when these 
objects were removed immediately after we got all objects in one atomic call. 
So it doens’t seem to be a practical issue.


  was:
Hive Metastore provides APIs for fetching a collection of objects (usually 
tables or partitions). These APIs provide a way to fetch all available objects 
so the size of the response is O(N) where N is the number of objects. These 
calls have several problems:

* All objects (and there may be thousands or even millions) should be fetched 
from the database, serialized to Java list of thrift objects then serialized 
into byte array for sending over the network. This creates spikes of huge 
memory pressure, especially since in some cases multiple of copies of the same 
data are present in memory (e.g. unserialized and serialized versions).
* Even though HMS tries to avoid string duplication by use of string interning 
in JAVA, duplicated strings must be serialized in the output array.
* Java has 2Gb limit on the maximum size of byte array, and crashes with Out Of 
Memory exception if this array size is exceeded
* Fetching huge amount of objects blows up DB caches and memory caches in the 
system.
Receiving such huge messages also creates memory pressure on the receiver side 
(usually HS2) which can cause it crashing with Out of Memory exception as well.
* Such requests have very big latencies since the server must collect all 
objects, serialize them and send them all to the network before the client can 
do anything with the result.

To prevent cases of Out Of Memory exceptions, the server now has a configurable 
limit on the maximum number of objects returned. This helps to avoid crashes, 
but doesn’t allow for correct query execution since the result will include 
random and incomplete set of K objects.

Currently this is addressed on the client side by simulating batching by 
getting list of table or partition names first and then requesting table 
information for parts of this list. Still, the list of objects can be big as 
well and this method requires locking to ensure that objects are not added or 
removed between the calls, especially if this is done outside of HS2.

Instead we can do simple modification of existing APIs which allows for batch 
iterator-style operations without keeping any server-side state. The main idea 
is to have a unique incrementing IDs for each objects. The IDs should be only 
unique within their container (e.g. table IDs should be unique within a 
database and partition IDs should be unique within a table). 
Such ID can be easily generated using database auto-increment mechanism or we 
can be simply reuse existing ID column that is already maintained by the Data 
Nucleus.
The request is then modified to include

* Starting ID i0
* Batch size (B)

The server fetches up to B objects starting from i0, serlalizes them and sends 
to the client. The client then requests next batch by using the ID of the last 
received request plus one. It is possible to construct an SQL query (either by 
using DataNucleus JDOQL or in DirectSQL code) which only selects needed objects 
avoiding big reads from the database. The client then iterates until it fetches 
all the objects and each request memory size is limited by the value of batch 
size.
        If we extend the API a little bit, providing a way to get the minimum 
and maximum ID values (either via a separate call or piggybacked to the normal 
reply), clients can request such batches concurrently, thus also reducing the 
latency. Clients can easily estimate number of batches by knowing the total 
number of IDs. While this isn’t a precise method it is good enough to divide 
the work.

It is also possible to wrap this in a way similar to {{PartitionIterator}} and 
async-fetch next batch while we are processing current batch.

* Consistency considerations*
* 
HMS only provides consistency guarantees for a single call. The set of objects 
that should be returned may change while we are iterating over it. In some 
cases this is not an issue since HS2 may use ZooKeeper locks on the table to 
prevent modifications, but in some cases this may be an issue (for example for 
calls that originate from external systems. We should consider additions and 
removals separately.

* New objects are added during iteration. All new objects are always added at 
the ‘end’ of ID space, so they will be always picked up by the iterator. We 
assume that IDs are always incrementing.
* Some objects are removed during iteration. Removal of objects that are not 
already consumed is not a problem.  It is possible that some objects which were 
already consumed are returned. Although this results in an inconsistent list of 
objects, this situation is indistinguishable from the situation when these 
objects were removed immediately after we got all objects in one atomic call. 
So it doens’t seem to be a practical issue.



> Adding metastore batch API for partitions
> -----------------------------------------
>
>                 Key: HIVE-19719
>                 URL: https://issues.apache.org/jira/browse/HIVE-19719
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 3.1.0, 4.0.0
>            Reporter: Alexander Kolbasov
>            Priority: Major
>
> Hive Metastore provides APIs for fetching a collection of objects (usually 
> tables or partitions). These APIs provide a way to fetch all available 
> objects so the size of the response is O(N) where N is the number of objects. 
> These calls have several problems:
> * All objects (and there may be thousands or even millions) should be fetched 
> from the database, serialized to Java list of thrift objects then serialized 
> into byte array for sending over the network. This creates spikes of huge 
> memory pressure, especially since in some cases multiple of copies of the 
> same data are present in memory (e.g. unserialized and serialized versions).
> * Even though HMS tries to avoid string duplication by use of string 
> interning in JAVA, duplicated strings must be serialized in the output array.
> * Java has 2Gb limit on the maximum size of byte array, and crashes with Out 
> Of Memory exception if this array size is exceeded
> * Fetching huge amount of objects blows up DB caches and memory caches in the 
> system.
> Receiving such huge messages also creates memory pressure on the receiver 
> side (usually HS2) which can cause it crashing with Out of Memory exception 
> as well.
> * Such requests have very big latencies since the server must collect all 
> objects, serialize them and send them all to the network before the client 
> can do anything with the result.
> To prevent cases of Out Of Memory exceptions, the server now has a 
> configurable limit on the maximum number of objects returned. This helps to 
> avoid crashes, but doesn’t allow for correct query execution since the result 
> will include random and incomplete set of K objects.
> Currently this is addressed on the client side by simulating batching by 
> getting list of table or partition names first and then requesting table 
> information for parts of this list. Still, the list of objects can be big as 
> well and this method requires locking to ensure that objects are not added or 
> removed between the calls, especially if this is done outside of HS2.
> Instead we can do simple modification of existing APIs which allows for batch 
> iterator-style operations without keeping any server-side state. The main 
> idea is to have a unique incrementing IDs for each objects. The IDs should be 
> only unique within their container (e.g. table IDs should be unique within a 
> database and partition IDs should be unique within a table). 
> Such ID can be easily generated using database auto-increment mechanism or we 
> can be simply reuse existing ID column that is already maintained by the Data 
> Nucleus.
> The request is then modified to include
> * Starting ID i0
> * Batch size (B)
> The server fetches up to B objects starting from i0, serlalizes them and 
> sends to the client. The client then requests next batch by using the ID of 
> the last received request plus one. It is possible to construct an SQL query 
> (either by using DataNucleus JDOQL or in DirectSQL code) which only selects 
> needed objects avoiding big reads from the database. The client then iterates 
> until it fetches all the objects and each request memory size is limited by 
> the value of batch size.
>       If we extend the API a little bit, providing a way to get the minimum 
> and maximum ID values (either via a separate call or piggybacked to the 
> normal reply), clients can request such batches concurrently, thus also 
> reducing the latency. Clients can easily estimate number of batches by 
> knowing the total number of IDs. While this isn’t a precise method it is good 
> enough to divide the work.
> It is also possible to wrap this in a way similar to {{PartitionIterator}} 
> and async-fetch next batch while we are processing current batch.
> *Consistency considerations*
> * HMS only provides consistency guarantees for a single call. The set of 
> objects that should be returned may change while we are iterating over it. In 
> some cases this is not an issue since HS2 may use ZooKeeper locks on the 
> table to prevent modifications, but in some cases this may be an issue (for 
> example for calls that originate from external systems. We should consider 
> additions and removals separately.
> * New objects are added during iteration. All new objects are always added at 
> the ‘end’ of ID space, so they will be always picked up by the iterator. We 
> assume that IDs are always incrementing.
> * Some objects are removed during iteration. Removal of objects that are not 
> already consumed is not a problem.  It is possible that some objects which 
> were already consumed are returned. Although this results in an inconsistent 
> list of objects, this situation is indistinguishable from the situation when 
> these objects were removed immediately after we got all objects in one atomic 
> call. So it doens’t seem to be a practical issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19719) Adding metastore batch API for partitions

Reply via email to