Jayanth Vennamreddy created AIRAVATA-3970:
---------------------------------------------

             Summary: [GSoC] Adaptive Metadata Indexing and ATLAS Integration 
in Airavata Data Catalog
                 Key: AIRAVATA-3970
                 URL: https://issues.apache.org/jira/browse/AIRAVATA-3970
             Project: Airavata
          Issue Type: Improvement
            Reporter: Jayanth Vennamreddy


Background:

Apache Airavata’s data catalog currently provides structured metadata storage 
but does not support workload-aware indexing or adaptive optimization 
strategies. With the integration of external scientific datasets such as ATLAS 
(molecular dynamics simulations of ~1,900 proteins with rich structural and 
domain metadata), the catalog must support advanced retrieval patterns 
including:

- Key-based lookups by PDB chain
- Multi-field metadata filtering (organism, domain classification, resolution, 
MD metrics)
- Batch metadata retrieval following similarity searches
- Potential cross-system integration with sequence/structure search tools

Problem:

Static indexing strategies do not scale well for scientific metadata workloads 
where query patterns vary significantly between research users. Additionally, 
current retrieval APIs are optimized for single-key access and do not support 
efficient bulk or filtered access.

Proposed Work:

1. Extend Airavata’s data catalog schema to support protein-scale metadata 
(including multi-value domain classification fields such as ECOD, CATH, SCOP).
2. Implement query telemetry to capture metadata access patterns.
3. Design and implement an adaptive indexing engine that:
   - Tracks frequently filtered fields
   - Dynamically creates or adjusts secondary indexes
   - Optimizes composite index creation based on workload statistics
4. Implement batch metadata retrieval APIs optimized for 
similarity-search-driven workflows.
5. Evaluate performance under synthetic scaling (10k–100k protein metadata 
records).

Expected Outcomes:

- Improved metadata retrieval latency
- Reduced scan overhead for filter-heavy queries
- Better support for scientific discovery workflows
- A foundation for scalable ATLAS-style dataset integration

This issue will serve as the tracking issue for the GSoC 2026 proposal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to