[ 
https://issues.apache.org/jira/browse/PHOENIX-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kadir Ozdemir reassigned PHOENIX-7018:
--------------------------------------

    Assignee: Viraj Jasani

> Server side index maintainer caching for read, write, and replication
> ---------------------------------------------------------------------
>
>                 Key: PHOENIX-7018
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7018
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Kadir Ozdemir
>            Assignee: Viraj Jasani
>            Priority: Major
>
> The relationship between a data table and index is somewhat involved. Phoenix 
> needs to transform a data table row to the corresponding index row, extract a 
> data table row key (i.e., a primary key) from an index table row key (a 
> secondary key), and map data table columns to index table included columns. 
> The metadata for these operations and the operations are encapsulated in the 
> class called IndexMaintainer. Phoenix creates a separate IndexMaintainer 
> object for each index table in memory on the client side. IndexMaintainer 
> objects are then serialized using the protobuf library and sent to servers 
> along with the mutations on the data tables and scans on the index tables. 
> The Phoenix server code (more accurately Phoenix coprocessors) then uses 
> IndexMaintainer objects to update indexes and leverage indexes for queries. 
> Phoenix coprocessors use IndexMaintainer objects associated with a given 
> batch of mutations or scan operation only once (i.e., for the batch or scan) 
> and the Phoenix client sends these objects along with every batch of 
> mutations and every scan. 
> The secondary indexes are used to improve the performance of queries on the 
> secondary index columns. The secondary indexes are required to be consistent 
> with their data tables. The consistency here means that regardless of whether 
> a query is served from a data table or index table, the same result is 
> returned. This consistency promise cannot be kept when the data and their 
> index table rows are replicated independently, which happens at the HBase 
> level replication. 
> HBase replicates WALs (Write Ahead Logs) of regions servers and replays these 
> WALs at the destination cluster. A given data table row and the corresponding 
> index table row are likely served by different region servers and the WALs of 
> these region servers are replicated independently. This means these rows can 
> arrive at different times which makes data and index tables inconsistent at 
> the destination cluster.
> Replicating global indexes leads to inefficient use of the replication 
> bandwidth due to the additional overhead of replicating data that can 
> actually be derived from the data that has been already replicated. When one 
> considers that an index table is essentially a copy of its data table without 
> the columns that are not included in the index, and a given data table can 
> have multiple indexes, it is easy to see that replicating indexes can double 
> the replication bandwidth requirement easily for a given data table.
> A solution for eliminating index table replication is to add just enough 
> metadata to WAL records for the mutations of the data tables with indexes and 
> have a replication endpoint and coprocessor endpoint to generate index 
> mutations from these records, please see PHOENIX-5315. This document extends 
> this solution to eliminate replicating index tables but also to improve read 
> and write path for index tables by maintaining a consistent server side 
> caching for index maintainers.
> The idea behind the proposed solution is to cache the index maintainers on 
> the server side and thus eliminate transferring index maintainers during read 
> and write as well as replication. The coprocessors that currently require the 
> index maintainers are IndexRegionObserver for the write path and some other 
> coprocessors including GlobalIndexChecker for read repair in the read path. 
> The proposed solution leverages the existing capability of adding 
> IndexMaintainer objects in the server side cache implemented by 
> ServerCachingEndpointImpl. The design eliminates global index table 
> replication and also eliminates the server side cache update with 
> IndexMaintainer objects for each batch write.
> IndexRegionObserver (the coprocessor that generates index mutations from data 
> table mutations) needs to access IndexMaintainer objects for the indexes on a 
> table or view. The metadata transferred as a mutation attribute will be used 
> to identify the table or view for which a mutation is. The metadata will 
> include the tenant Id, table schema, and table name. The cache key for the 
> array of index maintainers for this table or view will be formed from this 
> metadata. When IndexRegionObserver intercepts a mutation on an HBase table 
> (using the preBatchMutate coprocessor hook), IndexRegionObserver will form 
> the cache key for the array of index maintainer and retrieve it from the 
> server cache. 
> This design requires maintaining metadata caches at region servers. These 
> caches need to be consistent, that is, these caches should not have stale 
> metadata. To ensure this, when MetaDataEndpointImpl updates index metadata, 
> It will first invalidate the index maintainers caches. If the invalidation 
> fails, then the metadata operation fails and the failure response is returned 
> to the client.
> It is important to note that MetaDataEndpointImpl needs to use locking to 
> serialize read and write operations on the metadata. After the caches are 
> invalidated, the Phoenix coprocs would attempt to retrieve index maintainer 
> objects from MetaDataEndpointImpl for the next mutation or scan operation. 
> This retrieval operation has to wait for the ongoing metadata update 
> transaction to complete.
> For every query including point lookup queries, the Phoenix client currently 
> serializes an IndexMaintainer object, and attaches it to the scan object as a 
> scan attribute, and then Phoenix coprocessors deserialize it from the scan 
> object. To eliminate this serialization/deserialization and save the network 
> bandwidth for IndexMaintainer, the Phoenix client can pass the cache key for 
> the array of IndexMaintainer objects and the name of the index (instead of 
> the IndexMaintainer object), and the coprocessor can retrieve the the array 
> of IndexMaintainer objects from its server cache and identifies the one for 
> the given index. If the array of IndexMaintainer objects is not in the cache, 
> the coprocessor using the Phoenix client library can construct the array of 
> IndexMaintainer objects and populate the server cache with it.
> A mutation or batch of mutations on a data table requires updating all the 
> indexes on that data table. For the cache to be efficient, we need to have 
> the IndexMaintainer objects for all of these indexes. This is the reason this 
> design chooses to cache the array of IndexMaintainer objects (for a given 
> table or view) instead of caching individual IndexMaintainer objects. 
> This design achieves cache coherency which impacts the availability of the 
> metadata operations. For a metadata operation to succeed, 
> MetaDataEndpointImpl should be able to invalidate the index maintainer caches 
> on the region servers first so that Phoenix coprocessors would need to 
> retrieve the most recent metadata from MetaDataEndpointImpl when they need 
> it. Depending on how table regions are distributed over region servers, for a 
> given metadata a subset or all of the server caches may need to be 
> invalidated. 
> It is important to note that the general cluster availability is not impacted 
> significantly as the metadata operations are rare compared to the read/write 
> operations on the user data in terms of number of operations or frequency of 
> operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to