harish876 opened a new issue, #180: URL: https://github.com/apache/incubator-resilientdb/issues/180
# Background Currently, secondary indexing is offloaded to an external datastore by exporting all chain data to an external source. This introduces several moving parts, including a utility like [Python Cache](https://github.com/apache/incubator-resilientdb-resilient-python-cache). Although the chain is [in-memory](https://github.com/apache/incubator-resilientdb/blob/master/chain/state/chain_state.h#L38), every client issues a [GetAllBlocks](https://github.com/apache/incubator-resilientdb-graphql/blob/main/service/http_server/crow_service.cpp#L394) request. This retrieves the entire blockchain state, pulling *all* the data. While modifying this structure could improve overall blockchain state retrieval, that is not our current focus. # Goal Instead, our goal is to improve read latencies for queries based on secondary attribute lookups from the [storage layer](https://github.com/apache/incubator-resilientdb/blob/master/chain/storage/storage.h), which is the most common use case for applications. These would be for applications using the storage engine as a document store, i.e, storing values as a JSON object or otherwise as a simple key-value store. A few applications that do this, that are part of this release, are: 1. [ResLens](https://github.com/apache/incubator-resilientdb-ResLens) - Uses ResDB as a simple KV store 2. [ResCanvas](https://github.com/ResilientApp/ResCanvas) - Uses ResDB as a document store 3. [Consensus](https://github.com/ResilientApp/Coinsensus-Backend) - Used ResDB as a document store # Problem Statement Currently, if an application has to perform a lookup based on a secondary attribute, then the following steps need to be done. - Hit this endpoint [GetAllValues](https://github.com/apache/incubator-resilientdb-graphql/blob/main/service/http_server/crow_service.cpp#L67) - Then manually apply filtering logic in memory at the application layer. - The alternative solution is to export all the data to an external datastore and sync it constantly, and apply filtering logic on it. # Proposed Solution Composite Keys are a great way to add indexing support on non-primary attributes. They allow for indexing by a single field, multiple fields, and converging indexes for different workloads. They are used widely for this use case, for example, MySQL's [MyRocks](https://github.com/facebook/mysql-5.6/wiki/MyRocks-record-format#secondary-index-c) storage engine. They are also used by [Hyperledger](https://hyperledger-fabric.readthedocs.io/en/release-2.5/) blockchain as a lightweight indexing mechanism atop a Key-Value store [Composite keys in Hyperledger](https://pkg.go.dev/github.com/hyperledger/fabric/core/chaincode/shim#ChaincodeStub.CreateCompositeKey). We can leverage LevelDB or [RocksDB's BlobDB](https://github.com/facebook/rocksdb/wiki/BlobDB), the latter of which reduces write amplification on large value pairs, which is common with applications that use ResilientDB as a document store. # Technical Details - PR to be attached to the header files. - General Idea expressed below. 1. API layer adds 2 new endpoints - CreateCompositeKey - GetByCompositeKey 2. Storage Engine adds these 2 function calls as part of its interface and implements them. 3. Proto files need to be changed to add these 2 calls so that the API can talk to the ResilientDB Process. # Advantages of our Solution - **Improved Read Latency**: Enables faster lookups by indexing secondary attributes directly in the storage layer. - **Reduces Application Complexity**: Eliminates the need for client-side filtering logic or full data scans. - **No External Sync Required**: Removes dependency on external databases and continuous export/sync pipelines. - **Lightweight Implementation**: Composite keys require minimal overhead and can be implemented without significant architectural changes. - **Supports Richer Queries**: Enables filtering and retrieval by multiple fields or field combinations (converged indexes). - **Built on Proven Techniques**: Uses battle-tested patterns from systems like MyRocks (MySQL) and Hyperledger Fabric. - **Document Store Friendly**: Optimized for use cases where values are stored as JSON or large blobs, especially with RocksDB’s BlobDB. - **Scalable Design**: Can handle high write and read throughput while preserving query efficiency. - **Expands Use Cases**: Unlocks new classes of applications like dashboards, real-time analytics, and search-backed services. - **Easy Adoption**: Current Applications do not need to change their code. All these features are add-ons and can be added without deleting, modifying, or migrating data. # Disadvantages of our Solution - **Increased Write Amplification**: Any external index needs extra space; this is the classic space vs time conundrum in CS. We can limit applications to the number of composite keys they can create. Creating the actual keys takes up very little space. These composite keys are just heap pointers and are by default non-clustered indexes. - **Increased App Complexity**: We do not automatically create composite keys; rather ask the developer to build it. This acts just like inserting yet another key-value pair. Now the write becomes non-atomic, but this can be handled by the application, where first the process of inserting the data and creating the index is wrapped in a transaction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
