Kadir Ozdemir created PHOENIX-7018:
--------------------------------------
Summary: Server side index maintainer caching for read, write, and
replication
Key: PHOENIX-7018
URL: https://issues.apache.org/jira/browse/PHOENIX-7018
Project: Phoenix
Issue Type: Improvement
Reporter: Kadir Ozdemir
The relationship between a data table and index is somewhat involved. Phoenix
needs to transform a data table row to the corresponding index row, extract a
data table row key (i.e., a primary key) from an index table row key (a
secondary key), and map data table columns to index table included columns. The
metadata for these operations and the operations are encapsulated in the class
called IndexMaintainer. Phoenix creates a separate IndexMaintainer object for
each index table in memory on the client side. IndexMaintainer objects are then
serialized using the protobuf library and sent to servers along with the
mutations on the data tables and scans on the index tables. The Phoenix server
code (more accurately Phoenix coprocessors) then uses IndexMaintainer objects
to update indexes and leverage indexes for queries.
Phoenix coprocessors use IndexMaintainer objects associated with a given batch
of mutations or scan operation only once (i.e., for the batch or scan) and the
Phoenix client sends these objects along with every batch of mutations and
every scan.
The secondary indexes are used to improve the performance of queries on the
secondary index columns. The secondary indexes are required to be consistent
with their data tables. The consistency here means that regardless of whether a
query is served from a data table or index table, the same result is returned.
This consistency promise cannot be kept when the data and their index table
rows are replicated independently, which happens at the HBase level
replication.
HBase replicates WALs (Write Ahead Logs) of regions servers and replays these
WALs at the destination cluster. A given data table row and the corresponding
index table row are likely served by different region servers and the WALs of
these region servers are replicated independently. This means these rows can
arrive at different times which makes data and index tables inconsistent at the
destination cluster.
Replicating global indexes leads to inefficient use of the replication
bandwidth due to the additional overhead of replicating data that can actually
be derived from the data that has been already replicated. When one considers
that an index table is essentially a copy of its data table without the columns
that are not included in the index, and a given data table can have multiple
indexes, it is easy to see that replicating indexes can double the replication
bandwidth requirement easily for a given data table.
A solution for eliminating index table replication is to add just enough
metadata to WAL records for the mutations of the data tables with indexes and
have a replication endpoint and coprocessor endpoint to generate index
mutations from these records, please see PHOENIX-5315. This document extends
this solution to eliminate replicating index tables but also to improve read
and write path for index tables by maintaining a consistent server side caching
for index maintainers.
The idea behind the proposed solution is to cache the index maintainers on the
server side and thus eliminate transferring index maintainers during read and
write as well as replication. The coprocessors that currently require the index
maintainers are IndexRegionObserver for the write path and some other
coprocessors including GlobalIndexChecker for read repair in the read path.
The proposed solution leverages the existing capability of adding
IndexMaintainer objects in the server side cache implemented by
ServerCachingEndpointImpl. The design eliminates global index table replication
and also eliminates the server side cache update with IndexMaintainer objects
for each batch write.
IndexRegionObserver (the coprocessor that generates index mutations from data
table mutations) needs to access IndexMaintainer objects for the indexes on a
table or view. The metadata transferred as a mutation attribute will be used to
identify the table or view for which a mutation is. The metadata will include
the tenant Id, table schema, and table name. The cache key for the array of
index maintainers for this table or view will be formed from this metadata.
When IndexRegionObserver intercepts a mutation on an HBase table (using the
preBatchMutate coprocessor hook), IndexRegionObserver will form the cache key
for the array of index maintainer and retrieve it from the server cache.
This design requires maintaining metadata caches at region servers. These
caches need to be consistent, that is, these caches should not have stale
metadata. To ensure this, when MetaDataEndpointImpl updates index metadata, It
will first invalidate the index maintainers caches. If the invalidation fails,
then the metadata operation fails and the failure response is returned to the
client.
It is important to note that MetaDataEndpointImpl needs to use locking to
serialize read and write operations on the metadata. After the caches are
invalidated, the Phoenix coprocs would attempt to retrieve index maintainer
objects from MetaDataEndpointImpl for the next mutation or scan operation. This
retrieval operation has to wait for the ongoing metadata update transaction to
complete.
For every query including point lookup queries, the Phoenix client currently
serializes an IndexMaintainer object, and attaches it to the scan object as a
scan attribute, and then Phoenix coprocessors deserialize it from the scan
object. To eliminate this serialization/deserialization and save the network
bandwidth for IndexMaintainer, the Phoenix client can pass the cache key for
the array of IndexMaintainer objects and the name of the index (instead of the
IndexMaintainer object), and the coprocessor can retrieve the the array of
IndexMaintainer objects from its server cache and identifies the one for the
given index. If the array of IndexMaintainer objects is not in the cache, the
coprocessor using the Phoenix client library can construct the array of
IndexMaintainer objects and populate the server cache with it.
A mutation or batch of mutations on a data table requires updating all the
indexes on that data table. For the cache to be efficient, we need to have the
IndexMaintainer objects for all of these indexes. This is the reason this
design chooses to cache the array of IndexMaintainer objects (for a given table
or view) instead of caching individual IndexMaintainer objects.
This design achieves cache coherency which impacts the availability of the
metadata operations. For a metadata operation to succeed, MetaDataEndpointImpl
should be able to invalidate the index maintainer caches on the region servers
first so that Phoenix coprocessors would need to retrieve the most recent
metadata from MetaDataEndpointImpl when they need it. Depending on how table
regions are distributed over region servers, for a given metadata a subset or
all of the server caches may need to be invalidated.
It is important to note that the general cluster availability is not impacted
significantly as the metadata operations are rare compared to the read/write
operations on the user data in terms of number of operations or frequency of
operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)