Hi, Quick update on the dataset indexing work. I pushed a second commit that adds the mdCATH connector and restructures the project to be generic rather than MD-specific, as we discussed in previous meetings. What's new: * mdCATH adapter that reads from HDF5 (or pre-extracted JSON), parses the CATH domain hierarchy, and aggregates per-temperature/replica stats into the same POSIX tree format as ATLAS * Refactored the lib layer so it doesn't know about specific datasets. Adapters control what fields go where through callbacks instead of hardcoded parameters like sequence_field * Added a .meta/ directory convention (source.json for provenance, stats.json for index metadata, schema.json for field definitions) * Base crawler class with static and incremental strategies, ready for when we add datasets that update over time * CI pipeline with pytest, ruff, and schema validation * Go FUSE mount stub under mount/. Not functional yet but it sets up the module and documents how the .search/ virtual directory will work The goal is that any dataset, not just MD ones, can go through the same pipeline and come out as a standard POSIX tree. That way when we wire up the FUSE mount and CyberShuttle, everything just works the same. 68 tests passing, linter is clean. Repo: https://github.com/jayvenn21/gsoc-dataset-indexing Let me know if the direction looks good or if you'd want anything changed before I move on to GPCRmd/MemProtMD or the FUSE side.
Thanks, Jayanth
