Jayanth Vennamreddy created AIRAVATA-3970:
---------------------------------------------
Summary: [GSoC] Adaptive Metadata Indexing and ATLAS Integration
in Airavata Data Catalog
Key: AIRAVATA-3970
URL: https://issues.apache.org/jira/browse/AIRAVATA-3970
Project: Airavata
Issue Type: Improvement
Reporter: Jayanth Vennamreddy
Background:
Apache Airavata’s data catalog currently provides structured metadata storage
but does not support workload-aware indexing or adaptive optimization
strategies. With the integration of external scientific datasets such as ATLAS
(molecular dynamics simulations of ~1,900 proteins with rich structural and
domain metadata), the catalog must support advanced retrieval patterns
including:
- Key-based lookups by PDB chain
- Multi-field metadata filtering (organism, domain classification, resolution,
MD metrics)
- Batch metadata retrieval following similarity searches
- Potential cross-system integration with sequence/structure search tools
Problem:
Static indexing strategies do not scale well for scientific metadata workloads
where query patterns vary significantly between research users. Additionally,
current retrieval APIs are optimized for single-key access and do not support
efficient bulk or filtered access.
Proposed Work:
1. Extend Airavata’s data catalog schema to support protein-scale metadata
(including multi-value domain classification fields such as ECOD, CATH, SCOP).
2. Implement query telemetry to capture metadata access patterns.
3. Design and implement an adaptive indexing engine that:
- Tracks frequently filtered fields
- Dynamically creates or adjusts secondary indexes
- Optimizes composite index creation based on workload statistics
4. Implement batch metadata retrieval APIs optimized for
similarity-search-driven workflows.
5. Evaluate performance under synthetic scaling (10k–100k protein metadata
records).
Expected Outcomes:
- Improved metadata retrieval latency
- Reduced scan overhead for filter-heavy queries
- Better support for scientific discovery workflows
- A foundation for scalable ATLAS-style dataset integration
This issue will serve as the tracking issue for the GSoC 2026 proposal.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)