[
https://issues.apache.org/jira/browse/AIRAVATA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jayanth Vennamreddy updated AIRAVATA-3970:
------------------------------------------
Description:
Background:
Apache Airavata’s data catalog provides structured metadata storage but
currently lacks workload-aware optimization strategies and efficient support
for filter-heavy scientific metadata queries.
With potential integration of external scientific datasets such as ATLAS (a
molecular dynamics database containing ~1,900 protein simulations with rich
structural, domain, and MD metrics metadata), the catalog must support more
advanced retrieval patterns, including:
- Key-based lookups by PDB chain
- Multi-field metadata filtering (organism, domain classification, resolution,
MD metrics)
- Batch metadata retrieval following similarity searches
- Scalable performance under increasing dataset size (10k–100k+ records)
Currently, metadata retrieval mechanisms are optimized for structured storage
but do not incorporate workload-aware indexing or filter-efficient retrieval
for domain-rich scientific data.
Problem:
Static indexing strategies do not scale effectively for scientific metadata
workloads where query patterns vary across research users. Additionally,
current APIs are optimized primarily for single-key access and do not support
efficient bulk or filter-driven retrieval.
As Airavata evolves to support protein-scale metadata and
similarity-search-driven workflows, improvements in schema design, indexing
strategy, and retrieval efficiency become necessary.
Proposed Work:
1. Design and implement a normalized schema extension in Airavata’s data
catalog to represent ATLAS-style protein metadata, including support for
multi-value domain classification fields (ECOD, CATH, SCOP).
2. Implement a metadata ingestion pipeline to import ATLAS protein records
(~1,900 entries) into the catalog and validate correctness.
3. Add query telemetry instrumentation to metadata APIs to capture:
- Filter predicates used
- Query latency
- Result set size
- Field access frequency
4. Based on observed workload patterns, implement an index optimization module
that:
- Identifies high-frequency filter fields
- Recommends and creates appropriate secondary or composite indexes
- Benchmarks performance before and after index creation
5. Implement a batch metadata retrieval API optimized for
similarity-search-driven workflows, enabling efficient bulk fetch of protein
metadata records.
6. Evaluate performance under synthetic scaling (10k–100k records) to measure
query latency improvements and indexing overhead.
Expected Outcomes:
- ATLAS metadata fully represented in Airavata’s data catalog with proper
schema support.
- Telemetry-driven index optimization workflow implemented.
- Demonstrated reduction in metadata query latency for filter-heavy queries.
- Support for bulk metadata retrieval following similarity searches.
- Benchmarks and documentation demonstrating scalability improvements.
This issue will serve as the tracking issue for the GSoC 2026 proposal and
scope discussion.
was:
Background:
Apache Airavata’s data catalog currently provides structured metadata storage
but does not support workload-aware indexing or adaptive optimization
strategies. With the integration of external scientific datasets such as ATLAS
(molecular dynamics simulations of ~1,900 proteins with rich structural and
domain metadata), the catalog must support advanced retrieval patterns
including:
- Key-based lookups by PDB chain
- Multi-field metadata filtering (organism, domain classification, resolution,
MD metrics)
- Batch metadata retrieval following similarity searches
- Potential cross-system integration with sequence/structure search tools
Problem:
Static indexing strategies do not scale well for scientific metadata workloads
where query patterns vary significantly between research users. Additionally,
current retrieval APIs are optimized for single-key access and do not support
efficient bulk or filtered access.
Proposed Work:
1. Extend Airavata’s data catalog schema to support protein-scale metadata
(including multi-value domain classification fields such as ECOD, CATH, SCOP).
2. Implement query telemetry to capture metadata access patterns.
3. Design and implement an adaptive indexing engine that:
- Tracks frequently filtered fields
- Dynamically creates or adjusts secondary indexes
- Optimizes composite index creation based on workload statistics
4. Implement batch metadata retrieval APIs optimized for
similarity-search-driven workflows.
5. Evaluate performance under synthetic scaling (10k–100k protein metadata
records).
Expected Outcomes:
- Improved metadata retrieval latency
- Reduced scan overhead for filter-heavy queries
- Better support for scientific discovery workflows
- A foundation for scalable ATLAS-style dataset integration
This issue will serve as the tracking issue for the GSoC 2026 proposal.
> [GSoC] Adaptive Metadata Indexing and ATLAS Integration in Airavata Data
> Catalog
> --------------------------------------------------------------------------------
>
> Key: AIRAVATA-3970
> URL: https://issues.apache.org/jira/browse/AIRAVATA-3970
> Project: Airavata
> Issue Type: Improvement
> Reporter: Jayanth Vennamreddy
> Priority: Major
>
> Background:
> Apache Airavata’s data catalog provides structured metadata storage but
> currently lacks workload-aware optimization strategies and efficient support
> for filter-heavy scientific metadata queries.
> With potential integration of external scientific datasets such as ATLAS (a
> molecular dynamics database containing ~1,900 protein simulations with rich
> structural, domain, and MD metrics metadata), the catalog must support more
> advanced retrieval patterns, including:
> - Key-based lookups by PDB chain
> - Multi-field metadata filtering (organism, domain classification,
> resolution, MD metrics)
> - Batch metadata retrieval following similarity searches
> - Scalable performance under increasing dataset size (10k–100k+ records)
> Currently, metadata retrieval mechanisms are optimized for structured storage
> but do not incorporate workload-aware indexing or filter-efficient retrieval
> for domain-rich scientific data.
> Problem:
> Static indexing strategies do not scale effectively for scientific metadata
> workloads where query patterns vary across research users. Additionally,
> current APIs are optimized primarily for single-key access and do not support
> efficient bulk or filter-driven retrieval.
> As Airavata evolves to support protein-scale metadata and
> similarity-search-driven workflows, improvements in schema design, indexing
> strategy, and retrieval efficiency become necessary.
> Proposed Work:
> 1. Design and implement a normalized schema extension in Airavata’s data
> catalog to represent ATLAS-style protein metadata, including support for
> multi-value domain classification fields (ECOD, CATH, SCOP).
> 2. Implement a metadata ingestion pipeline to import ATLAS protein records
> (~1,900 entries) into the catalog and validate correctness.
> 3. Add query telemetry instrumentation to metadata APIs to capture:
> - Filter predicates used
> - Query latency
> - Result set size
> - Field access frequency
> 4. Based on observed workload patterns, implement an index optimization
> module that:
> - Identifies high-frequency filter fields
> - Recommends and creates appropriate secondary or composite indexes
> - Benchmarks performance before and after index creation
> 5. Implement a batch metadata retrieval API optimized for
> similarity-search-driven workflows, enabling efficient bulk fetch of protein
> metadata records.
> 6. Evaluate performance under synthetic scaling (10k–100k records) to measure
> query latency improvements and indexing overhead.
> Expected Outcomes:
> - ATLAS metadata fully represented in Airavata’s data catalog with proper
> schema support.
> - Telemetry-driven index optimization workflow implemented.
> - Demonstrated reduction in metadata query latency for filter-heavy queries.
> - Support for bulk metadata retrieval following similarity searches.
> - Benchmarks and documentation demonstrating scalability improvements.
> This issue will serve as the tracking issue for the GSoC 2026 proposal and
> scope discussion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)