RKuttruff opened a new pull request, #294: URL: https://github.com/apache/incubator-sdap-nexus/pull/294
# SDAP-472 Major overhaul of the `data-access` component of SDAP to support multiple data store backends simultaneously with one new backend implemented to support gridded Zarr data stored either locally or in S3. Datasets are defined in the `nexusdatasets` Solr collection. SDAP will poll that collection (currently done hourly + on startup + on execution of a dataset management query) and attempt to add/open any new datasets and will drop any datasets that no longer are present. Datasets can still be defined in the manner they currently are; `nexusproto` is the default backend and requires no additional data. ## Adding Datasets There are 2 ways to add new Zarr datasets. The 'hardcoded` approach through the Collections Manager, or the dynamic approach through the dataset management endpoints. ### Collection Manager A zarr collection can be specified in the collection config YAML file as follows: ```yaml collections: - id: dataset_name path: file:///path/to/zarr/root/ projection: Grid priority: <number> dimensionNames: latitude: <latitude name> longitude: <longitude name> time: <time name> variable: <data var> storeType: zarr - id: dataset_s3 path: s3://bucket/key/ projection: GridMulti priority: <number> dimensionNames: latitude: <latitude name> longitude: <longitude name> time: <time name> variables: - <data var> - <data var> - <data var> storeType: zarr config: aws: accessKeyID: <AWS access key ID> secretAccessKey: <AWS secret access key> public: false ``` These datasets are strictly hardcoded and can only (currently) be removed by manually deleting the associated document from Solr; they cannot be deleted or altered through the dataset management endpoints. There is an [accompanying ingester PR](https://github.com/apache/incubator-sdap-ingester/pull/86) to facilitate this. ### Dataset Management Endpoints Included are a set of endpoints to add, update and remove zarr datasets on the fly. #### Add Dataset - Path: `/datasets/add` - Type: `POST` - Params: - `name`: Name of the dataset to add - `path`: Path of the root of the Zarr group to add Body: Content types: `application/json`, `application/yaml` ```yaml variable: <var> coords: latitude: <lat name> longitude: <lon name> time: <time name> aws: # required if in S3 public: false accessKeyID: <AWS access key ID> secretAccessKey: <AWS secret access key> region: <AWS region> ``` #### Update Dataset - Path: `/datasets/update` - Type: `POST` - Params: - `name`: Name of the dataset to update Body: Content types: `application/json`, `application/yaml` Body is the same format as `/datasets/add` #### Delete Dataset - Path: `/datasets/remove` - Type: `GET` - Params: - `name`: Name of the dataset to delete ## Testing This PR will require extensive testing to ensure that a) the added Zarr backend either fully, or at least mostly, supports all existing (prior functioning) SDAP algorithms, and b) that nothing is broken when using the `nexusproto` backend (ie, all existing functionality is preserved). Ideally, this will require little to no adaptation for the individual algorithm implementations; however, it's seeming like a number of them will already require small changes that _should_ not have any impact on `nexusproto` functionality. ### `nexusproto` Testing Currently, the interface for the algorithms and the data backends will route data requests to the `nexusproto` backend by default (i.e., if no target dataset was given / could be determined). This may not end up being desirable and may end up removed depending on discussion for this PR. With that in mind, the net result of this defaulting is that this PR _should_ not break any existing functionality. As a quick check of this, I ran a test suite for endpoints used by the CDMS project and found no endpoints failing or returning inconsistent data. Further tests will be conducted when verifying that queries return the same when running against the same dataset in `nexusproto` and zarr. ### `zarr` Testing The following table lists the various algorithms/endpoints that have been tested with Zarr support. The 'Working' column indicates that the endpoint successfully returns with data; the 'Validated' column indicates that the data returned is identical to the same query on the same dataset ingested to `nexusproto`; the 'Alterations' column lists the various alterations used to get the algorithm working (detailed below). | Endpoint | Working | Validated | Alterations | |---------------------------------|:---------:|:-----------:|-------------| | `/datainbounds` | X | X | | | `/cdmssubset` | X | X | e | | `/timeSeriesSpark` | X | | c | | `/latitudeTimeHofMoellerSpark` | X | | b,c | | `/longitudeTimeHofMoellerSpark` | X | | b,c | | `/timeAvgMapSpark` | X | | a | | `/match_spark` | X | X | c | | `/corrMapSpark` | X | | b,d | | `/dailydifferenceaverage_spark` | X | | c | | `/maxMinMapSpark` | X | | | | `/climMapSpark` | X | | b | | `/varianceSpark` | X | | b | a. Dependent on @kevinmarlis's #259 -- Now merged; no longer a concern b. Dependent on similar changes to (a), outlined in #272 c. Modifications to some NTS calls (specified source dataset, &c) d. Bug fix unrelated to backend changes e. Dependent on #268 <hr> The hope for this implementation was that it would integrate seamlessly with the existing algorithms; however, it appears some algorithms will be required to add some kwargs to certain `NexusTileService` functions. In particular, it is now imperative that the target dataset name is given via the `dataset` or `ds` kwarg (if neither are specified by the function definition (ie, as in `find_tile_by_id`), either kwarg can be used) <hr> Originally #265 which was closed automatically on merge & delete so I had to reopen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@sdap.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org