Kadir Ozdemir created PHOENIX-7018:
--------------------------------------

             Summary: Server side index maintainer caching for read, write, and 
replication
                 Key: PHOENIX-7018
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7018
             Project: Phoenix
          Issue Type: Improvement
            Reporter: Kadir Ozdemir


The relationship between a data table and index is somewhat involved. Phoenix 
needs to transform a data table row to the corresponding index row, extract a 
data table row key (i.e., a primary key) from an index table row key (a 
secondary key), and map data table columns to index table included columns. The 
metadata for these operations and the operations are encapsulated in the class 
called IndexMaintainer. Phoenix creates a separate IndexMaintainer object for 
each index table in memory on the client side. IndexMaintainer objects are then 
serialized using the protobuf library and sent to servers along with the 
mutations on the data tables and scans on the index tables. The Phoenix server 
code (more accurately Phoenix coprocessors) then uses IndexMaintainer objects 
to update indexes and leverage indexes for queries. 

Phoenix coprocessors use IndexMaintainer objects associated with a given batch 
of mutations or scan operation only once (i.e., for the batch or scan) and the 
Phoenix client sends these objects along with every batch of mutations and 
every scan. 

The secondary indexes are used to improve the performance of queries on the 
secondary index columns. The secondary indexes are required to be consistent 
with their data tables. The consistency here means that regardless of whether a 
query is served from a data table or index table, the same result is returned. 
This consistency promise cannot be kept when the data and their index table 
rows are replicated independently, which happens at the HBase level 
replication. 

HBase replicates WALs (Write Ahead Logs) of regions servers and replays these 
WALs at the destination cluster. A given data table row and the corresponding 
index table row are likely served by different region servers and the WALs of 
these region servers are replicated independently. This means these rows can 
arrive at different times which makes data and index tables inconsistent at the 
destination cluster.

Replicating global indexes leads to inefficient use of the replication 
bandwidth due to the additional overhead of replicating data that can actually 
be derived from the data that has been already replicated. When one considers 
that an index table is essentially a copy of its data table without the columns 
that are not included in the index, and a given data table can have multiple 
indexes, it is easy to see that replicating indexes can double the replication 
bandwidth requirement easily for a given data table.

A solution for eliminating index table replication is to add just enough 
metadata to WAL records for the mutations of the data tables with indexes and 
have a replication endpoint and coprocessor endpoint to generate index 
mutations from these records, please see PHOENIX-5315. This document extends 
this solution to eliminate replicating index tables but also to improve read 
and write path for index tables by maintaining a consistent server side caching 
for index maintainers.

The idea behind the proposed solution is to cache the index maintainers on the 
server side and thus eliminate transferring index maintainers during read and 
write as well as replication. The coprocessors that currently require the index 
maintainers are IndexRegionObserver for the write path and some other 
coprocessors including GlobalIndexChecker for read repair in the read path. 

The proposed solution leverages the existing capability of adding 
IndexMaintainer objects in the server side cache implemented by 
ServerCachingEndpointImpl. The design eliminates global index table replication 
and also eliminates the server side cache update with IndexMaintainer objects 
for each batch write.

IndexRegionObserver (the coprocessor that generates index mutations from data 
table mutations) needs to access IndexMaintainer objects for the indexes on a 
table or view. The metadata transferred as a mutation attribute will be used to 
identify the table or view for which a mutation is. The metadata will include 
the tenant Id, table schema, and table name. The cache key for the array of 
index maintainers for this table or view will be formed from this metadata. 
When IndexRegionObserver intercepts a mutation on an HBase table (using the 
preBatchMutate coprocessor hook), IndexRegionObserver will form the cache key 
for the array of index maintainer and retrieve it from the server cache. 

This design requires maintaining metadata caches at region servers. These 
caches need to be consistent, that is, these caches should not have stale 
metadata. To ensure this, when MetaDataEndpointImpl updates index metadata, It 
will first invalidate the index maintainers caches. If the invalidation fails, 
then the metadata operation fails and the failure response is returned to the 
client.

It is important to note that MetaDataEndpointImpl needs to use locking to 
serialize read and write operations on the metadata. After the caches are 
invalidated, the Phoenix coprocs would attempt to retrieve index maintainer 
objects from MetaDataEndpointImpl for the next mutation or scan operation. This 
retrieval operation has to wait for the ongoing metadata update transaction to 
complete.

For every query including point lookup queries, the Phoenix client currently 
serializes an IndexMaintainer object, and attaches it to the scan object as a 
scan attribute, and then Phoenix coprocessors deserialize it from the scan 
object. To eliminate this serialization/deserialization and save the network 
bandwidth for IndexMaintainer, the Phoenix client can pass the cache key for 
the array of IndexMaintainer objects and the name of the index (instead of the 
IndexMaintainer object), and the coprocessor can retrieve the the array of 
IndexMaintainer objects from its server cache and identifies the one for the 
given index. If the array of IndexMaintainer objects is not in the cache, the 
coprocessor using the Phoenix client library can construct the array of 
IndexMaintainer objects and populate the server cache with it.

A mutation or batch of mutations on a data table requires updating all the 
indexes on that data table. For the cache to be efficient, we need to have the 
IndexMaintainer objects for all of these indexes. This is the reason this 
design chooses to cache the array of IndexMaintainer objects (for a given table 
or view) instead of caching individual IndexMaintainer objects. 

This design achieves cache coherency which impacts the availability of the 
metadata operations. For a metadata operation to succeed, MetaDataEndpointImpl 
should be able to invalidate the index maintainer caches on the region servers 
first so that Phoenix coprocessors would need to retrieve the most recent 
metadata from MetaDataEndpointImpl when they need it. Depending on how table 
regions are distributed over region servers, for a given metadata a subset or 
all of the server caches may need to be invalidated. 

It is important to note that the general cluster availability is not impacted 
significantly as the metadata operations are rare compared to the read/write 
operations on the user data in terms of number of operations or frequency of 
operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to