harish876 opened a new issue, #180:
URL: https://github.com/apache/incubator-resilientdb/issues/180

   # Background
   
   Currently, secondary indexing is offloaded to an external datastore by 
exporting all chain data to an external source. This introduces several moving 
parts, including a utility like [Python 
Cache](https://github.com/apache/incubator-resilientdb-resilient-python-cache).
   
   Although the chain is 
[in-memory](https://github.com/apache/incubator-resilientdb/blob/master/chain/state/chain_state.h#L38),
 every client issues a 
[GetAllBlocks](https://github.com/apache/incubator-resilientdb-graphql/blob/main/service/http_server/crow_service.cpp#L394)
 request. This retrieves the entire blockchain state, pulling *all* the data. 
While modifying this structure could improve overall blockchain state 
retrieval, that is not our current focus.
   
   # Goal
   
   Instead, our goal is to improve read latencies for queries based on 
secondary attribute lookups from the [storage 
layer](https://github.com/apache/incubator-resilientdb/blob/master/chain/storage/storage.h),
 which is the most common use case for applications. These would be for 
applications using the storage engine as a document store, i.e, storing values 
as a JSON object or otherwise as a simple key-value store.  A few applications 
that do this, that are part of this release, are:
   
   1. [ResLens](https://github.com/apache/incubator-resilientdb-ResLens) - Uses 
ResDB as a simple KV store
   2. [ResCanvas](https://github.com/ResilientApp/ResCanvas) - Uses ResDB as a 
document store
   3. [Consensus](https://github.com/ResilientApp/Coinsensus-Backend) - Used 
ResDB as a document store
   
   # Problem Statement
   
   Currently, if an application has to perform a lookup based on a secondary 
attribute, then the following steps need to be done.
    - Hit this endpoint 
[GetAllValues](https://github.com/apache/incubator-resilientdb-graphql/blob/main/service/http_server/crow_service.cpp#L67)
    - Then manually apply filtering logic in memory at the application layer.
    - The alternative solution is to export all the data to an external 
datastore and sync it constantly, and apply filtering logic on it.
   
   # Proposed Solution
   
   Composite Keys are a great way to add indexing support on non-primary 
attributes. They allow for indexing by a single field, multiple fields, and 
converging indexes for different workloads.  They are used widely for this use 
case, for example, MySQL's 
[MyRocks](https://github.com/facebook/mysql-5.6/wiki/MyRocks-record-format#secondary-index-c)
 storage engine.
   
   They are also used by 
[Hyperledger](https://hyperledger-fabric.readthedocs.io/en/release-2.5/) 
blockchain as a lightweight indexing mechanism atop a Key-Value store 
[Composite keys in 
Hyperledger](https://pkg.go.dev/github.com/hyperledger/fabric/core/chaincode/shim#ChaincodeStub.CreateCompositeKey).
   
   We can leverage LevelDB or [RocksDB's 
BlobDB](https://github.com/facebook/rocksdb/wiki/BlobDB), the latter of which 
reduces write amplification on large value pairs, which is common with 
applications that use ResilientDB as a document store.
   
   # Technical Details
    - PR to be attached to the header files.
    - General Idea expressed below.
   
    1. API layer adds 2 new endpoints
          - CreateCompositeKey
          - GetByCompositeKey
    
    2. Storage Engine adds these 2 function calls as part of its interface and 
implements them.
    3. Proto files need to be changed to add these 2 calls so that the API can 
talk to the ResilientDB Process.
   
   
   # Advantages of our Solution
    - **Improved Read Latency**: Enables faster lookups by indexing secondary 
attributes directly in the storage layer.
   
   - **Reduces Application Complexity**: Eliminates the need for client-side 
filtering logic or full data scans.
   
   - **No External Sync Required**: Removes dependency on external databases 
and continuous export/sync pipelines.
   
   - **Lightweight Implementation**: Composite keys require minimal overhead 
and can be implemented without significant architectural changes.
   
   - **Supports Richer Queries**: Enables filtering and retrieval by multiple 
fields or field combinations (converged indexes).
   
   - **Built on Proven Techniques**: Uses battle-tested patterns from systems 
like MyRocks (MySQL) and Hyperledger Fabric.
   
   - **Document Store Friendly**: Optimized for use cases where values are 
stored as JSON or large blobs, especially with RocksDB’s BlobDB.
   
   - **Scalable Design**: Can handle high write and read throughput while 
preserving query efficiency.
   
   - **Expands Use Cases**: Unlocks new classes of applications like 
dashboards, real-time analytics, and search-backed services.
   
   - **Easy Adoption**: Current Applications do not need to change their code. 
All these features are add-ons and can be added without deleting, modifying, or 
migrating data.
   
   # Disadvantages of our Solution
    - **Increased Write Amplification**: Any external index needs extra space; 
this is the classic space vs time conundrum in CS. We can limit applications to 
the number of composite keys they can create. Creating the actual keys takes up 
very little space. These composite keys are just heap pointers and are by 
default non-clustered indexes.
   
   - **Increased App Complexity**: We do not automatically create composite 
keys; rather ask the developer to build it. This acts just like inserting yet 
another key-value pair.  Now the write becomes non-atomic, but this can be 
handled by the application, where first the process of inserting the data and 
creating the index is wrapped in a transaction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to