[ 
https://issues.apache.org/jira/browse/AIRAVATA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayanth Vennamreddy updated AIRAVATA-3970:
------------------------------------------
    Description: 
Background:

Apache Airavata’s data catalog provides structured metadata storage but 
currently lacks workload-aware optimization strategies and efficient support 
for filter-heavy scientific metadata queries. 

With potential integration of external scientific datasets such as ATLAS (a 
molecular dynamics database containing ~1,900 protein simulations with rich 
structural, domain, and MD metrics metadata), the catalog must support more 
advanced retrieval patterns, including:

- Key-based lookups by PDB chain
- Multi-field metadata filtering (organism, domain classification, resolution, 
MD metrics)
- Batch metadata retrieval following similarity searches
- Scalable performance under increasing dataset size (10k–100k+ records)

Currently, metadata retrieval mechanisms are optimized for structured storage 
but do not incorporate workload-aware indexing or filter-efficient retrieval 
for domain-rich scientific data.

Problem:

Static indexing strategies do not scale effectively for scientific metadata 
workloads where query patterns vary across research users. Additionally, 
current APIs are optimized primarily for single-key access and do not support 
efficient bulk or filter-driven retrieval.

As Airavata evolves to support protein-scale metadata and 
similarity-search-driven workflows, improvements in schema design, indexing 
strategy, and retrieval efficiency become necessary.

Proposed Work:

1. Design and implement a normalized schema extension in Airavata’s data 
catalog to represent ATLAS-style protein metadata, including support for 
multi-value domain classification fields (ECOD, CATH, SCOP).

2. Implement a metadata ingestion pipeline to import ATLAS protein records 
(~1,900 entries) into the catalog and validate correctness.

3. Add query telemetry instrumentation to metadata APIs to capture:
   - Filter predicates used
   - Query latency
   - Result set size
   - Field access frequency

4. Based on observed workload patterns, implement an index optimization module 
that:
   - Identifies high-frequency filter fields
   - Recommends and creates appropriate secondary or composite indexes
   - Benchmarks performance before and after index creation

5. Implement a batch metadata retrieval API optimized for 
similarity-search-driven workflows, enabling efficient bulk fetch of protein 
metadata records.

6. Evaluate performance under synthetic scaling (10k–100k records) to measure 
query latency improvements and indexing overhead.

Expected Outcomes:

- ATLAS metadata fully represented in Airavata’s data catalog with proper 
schema support.
- Telemetry-driven index optimization workflow implemented.
- Demonstrated reduction in metadata query latency for filter-heavy queries.
- Support for bulk metadata retrieval following similarity searches.
- Benchmarks and documentation demonstrating scalability improvements.

This issue will serve as the tracking issue for the GSoC 2026 proposal and 
scope discussion.

  was:
Background:

Apache Airavata’s data catalog currently provides structured metadata storage 
but does not support workload-aware indexing or adaptive optimization 
strategies. With the integration of external scientific datasets such as ATLAS 
(molecular dynamics simulations of ~1,900 proteins with rich structural and 
domain metadata), the catalog must support advanced retrieval patterns 
including:

- Key-based lookups by PDB chain
- Multi-field metadata filtering (organism, domain classification, resolution, 
MD metrics)
- Batch metadata retrieval following similarity searches
- Potential cross-system integration with sequence/structure search tools

Problem:

Static indexing strategies do not scale well for scientific metadata workloads 
where query patterns vary significantly between research users. Additionally, 
current retrieval APIs are optimized for single-key access and do not support 
efficient bulk or filtered access.

Proposed Work:

1. Extend Airavata’s data catalog schema to support protein-scale metadata 
(including multi-value domain classification fields such as ECOD, CATH, SCOP).
2. Implement query telemetry to capture metadata access patterns.
3. Design and implement an adaptive indexing engine that:
   - Tracks frequently filtered fields
   - Dynamically creates or adjusts secondary indexes
   - Optimizes composite index creation based on workload statistics
4. Implement batch metadata retrieval APIs optimized for 
similarity-search-driven workflows.
5. Evaluate performance under synthetic scaling (10k–100k protein metadata 
records).

Expected Outcomes:

- Improved metadata retrieval latency
- Reduced scan overhead for filter-heavy queries
- Better support for scientific discovery workflows
- A foundation for scalable ATLAS-style dataset integration

This issue will serve as the tracking issue for the GSoC 2026 proposal.


> [GSoC] Adaptive Metadata Indexing and ATLAS Integration in Airavata Data 
> Catalog
> --------------------------------------------------------------------------------
>
>                 Key: AIRAVATA-3970
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-3970
>             Project: Airavata
>          Issue Type: Improvement
>            Reporter: Jayanth Vennamreddy
>            Priority: Major
>
> Background:
> Apache Airavata’s data catalog provides structured metadata storage but 
> currently lacks workload-aware optimization strategies and efficient support 
> for filter-heavy scientific metadata queries. 
> With potential integration of external scientific datasets such as ATLAS (a 
> molecular dynamics database containing ~1,900 protein simulations with rich 
> structural, domain, and MD metrics metadata), the catalog must support more 
> advanced retrieval patterns, including:
> - Key-based lookups by PDB chain
> - Multi-field metadata filtering (organism, domain classification, 
> resolution, MD metrics)
> - Batch metadata retrieval following similarity searches
> - Scalable performance under increasing dataset size (10k–100k+ records)
> Currently, metadata retrieval mechanisms are optimized for structured storage 
> but do not incorporate workload-aware indexing or filter-efficient retrieval 
> for domain-rich scientific data.
> Problem:
> Static indexing strategies do not scale effectively for scientific metadata 
> workloads where query patterns vary across research users. Additionally, 
> current APIs are optimized primarily for single-key access and do not support 
> efficient bulk or filter-driven retrieval.
> As Airavata evolves to support protein-scale metadata and 
> similarity-search-driven workflows, improvements in schema design, indexing 
> strategy, and retrieval efficiency become necessary.
> Proposed Work:
> 1. Design and implement a normalized schema extension in Airavata’s data 
> catalog to represent ATLAS-style protein metadata, including support for 
> multi-value domain classification fields (ECOD, CATH, SCOP).
> 2. Implement a metadata ingestion pipeline to import ATLAS protein records 
> (~1,900 entries) into the catalog and validate correctness.
> 3. Add query telemetry instrumentation to metadata APIs to capture:
>    - Filter predicates used
>    - Query latency
>    - Result set size
>    - Field access frequency
> 4. Based on observed workload patterns, implement an index optimization 
> module that:
>    - Identifies high-frequency filter fields
>    - Recommends and creates appropriate secondary or composite indexes
>    - Benchmarks performance before and after index creation
> 5. Implement a batch metadata retrieval API optimized for 
> similarity-search-driven workflows, enabling efficient bulk fetch of protein 
> metadata records.
> 6. Evaluate performance under synthetic scaling (10k–100k records) to measure 
> query latency improvements and indexing overhead.
> Expected Outcomes:
> - ATLAS metadata fully represented in Airavata’s data catalog with proper 
> schema support.
> - Telemetry-driven index optimization workflow implemented.
> - Demonstrated reduction in metadata query latency for filter-heavy queries.
> - Support for bulk metadata retrieval following similarity searches.
> - Benchmarks and documentation demonstrating scalability improvements.
> This issue will serve as the tracking issue for the GSoC 2026 proposal and 
> scope discussion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to