This is an automated email from the ASF dual-hosted git repository.

rkk pushed a commit to branch develop
in repository https://gitbox.apache.org/repos/asf/sdap-nexus.git


The following commit(s) were added to refs/heads/develop by this push:
     new d7fea77  SDAP-518 - Collection Config Docs (#311)
d7fea77 is described below

commit d7fea77ba3f69a2440c5ad66fb07e3ec9f464c0b
Author: Riley Kuttruff <[email protected]>
AuthorDate: Thu Jun 6 16:07:51 2024 -0700

    SDAP-518 - Collection Config Docs (#311)
    
    * Initial work on CC docs
    
    * Add collections to index toctree
    
    * YAML highlighting
    
    * NetCDF section
    
    * remove incubation msg from intro.rst
    
    * Added recs for gridded tile size
    
    ---------
    
    Co-authored-by: rileykk <[email protected]>
---
 docs/collections.rst | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++
 docs/index.rst       |   7 +--
 docs/intro.rst       |   4 --
 3 files changed, 166 insertions(+), 9 deletions(-)

diff --git a/docs/collections.rst b/docs/collections.rst
new file mode 100644
index 0000000..becd38f
--- /dev/null
+++ b/docs/collections.rst
@@ -0,0 +1,164 @@
+.. _collections:
+
+***********************
+Collection Config Guide
+***********************
+
+Introduction
+============
+
+The Collection Config is a configuration file that defines collections to be 
ingested and maintained in SDAP. Currently,
+it supports defining collections of NetCDF data that will be processed into 
the custom NEXUS protobuf tile format or gridded
+Zarr data which can be used by SDAP directly with no need for processing. SDAP 
Ingester currently supports source data stored
+in AWS S3 or on the local filesystem (currently, however, not both at the same 
time).
+
+This guide will explain how to set up both protobuf and Zarr collections.
+
+.. _collections-basics:
+
+Basic Structure
+===============
+
+The Collection Config is a YAML file containing a single list named 
``collections``:
+
+.. code-block:: yaml
+
+  collections: []
+
+The items in this list are the collections defined and they have the basic 
structure:
+
+.. code-block:: yaml
+
+  - id: <single variable collection name>
+    path: <root collection location. Local path or S3 URI>
+    priority: <queue priority>
+    projection: <Grid | Swath>
+    dimensionNames:
+      latitude: <name of the latitude coordinate in the data>
+      longitude: <name of the longitude coordinate in the data>
+      time: <name of the time coordinate in the data>
+      variable: <variable name>
+  - id: <multi variable collection name>
+    path: <root collection location. Local path or S3 URI>
+    priority: <queue priority>
+    projection: <GridMulti | SwathMulti>
+    dimensionNames:
+      latitude: <name of the latitude coordinate in the data>
+      longitude: <name of the longitude coordinate in the data>
+      time: <name of the time coordinate in the data>
+      variables:
+      - <variable name 1>
+      - <variable name 2>
+      - <variable name 3>
+
+There are slight variations and additions to this structure depending on the 
type of collection, which will be covered below.
+
+.. _collections-nc:
+
+NetCDF - Protobuf Collections
+=============================
+
+For NetCDF data, you'll also need to tell the Ingester how big you want to 
make the tiles. This is set with the ``slices``
+object, which is a dictionary mapping dimension names to slice lengths. 
Omitted dimensions are assumed to be 1. It is important
+to set tile sizes that are not too big as to result in excess unnecessary data 
transfer, but also not too small as to result in
+an explosion in the number of generated tiles, which will lead to excessive 
metadata storage overhead and possible performance
+degradations. For gridded data, we recommend tile sizes between 30 x 30 and 
100 x 100, we also strongly recommend swath tiles be 
+sized no larger than 15 x 15, as the current method for handling swath data is 
very memory inefficient scaled rapidly by tile size.
+
+.. note:: The source dataset dimension names are used in slice definitions, 
not the coordinate names as in the ``dimensionNames`` object. In gridded 
datasets, these names are often the same, but this is not the case for swath 
data.
+
+Example:
+
+.. code-block:: yaml
+
+  collections:
+  - id: MUR25-JPL-L4-GLOB-v04.2
+    path: s3://mur-sst/zarr-v1/
+    priority: 1
+    projection: Grid
+    dimensionNames:
+      latitude: lat
+      longitude: lon
+      time: time
+      variable: analysed_sst
+    slices:
+      lat: 100
+      lon: 100
+      time: 1
+  - id: ASCATB-L2-Coastal
+    path: s3://example-bucket/swath-path/
+    priority: 1
+    projection: SwathMulti
+    dimensionNames:
+      latitude: lat
+      longitude: lon
+      time: time
+      variables:
+      - wind_speed
+      - wind_dir
+    slices:
+      NUMROWS: 15
+      NUMROWS: 15
+
+
+.. _collections-zarr:
+
+Zarr Collections
+================
+
+To specify a collection as a Zarr collection, simply add ``storeType: zarr`` 
to the collection object. If the data is local,
+this is all you need to do.
+
+.. code-block:: yaml
+
+  id: <collection name>
+  path: <root collection location. Local path>
+  priority: <queue priority>
+  projection: <Grid | GridMulti>
+  storeType: zarr
+  dimensionNames:
+    latitude: <name of the latitude coordinate in the data>
+    longitude: <name of the longitude coordinate in the data>
+    time: <name of the time coordinate in the data>
+    variable: <variable name>
+
+For data in S3, you need to provide information on how to access the data. 
This is currently done with the ``config.aws`` object.
+
+You will need to provide credentials to access the bucket, or specify if it is 
public:
+
+Example:
+
+.. code-block:: yaml
+
+  collections:
+  - id: MUR_SST
+    path: s3://mur-sst/zarr-v1/
+    priority: 1
+    projection: Grid
+    storeType: zarr
+    dimensionNames:
+      latitude: lat
+      longitude: lon
+      time: time
+      variable: analysed_sst
+    config:
+      aws:
+        public: true
+  - id: private_data
+    path: s3://example-bucket/zarr/path/
+    priority: 1
+    projection: GridMulti
+    storeType: zarr
+    dimensionNames:
+      latitude: lat
+      longitude: lon
+      time: time
+      variables:
+      - var1
+      - var2
+      - var3
+    config:
+      aws:
+        accessKeyID: <secret>
+        secretAccessKey: <secret>
+        public: false
diff --git a/docs/index.rst b/docs/index.rst
index 649d0e7..d942a6f 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,15 +1,12 @@
-Welcome to incubator-sdap-nexus's documentation!
+Welcome to the Apache SDAP project documentation!
 ================================================
 
-.. warning::
-
-  Apache incubator-sdap-nexus is an effort undergoing incubation at The Apache 
Software Foundation (ASF), sponsored by the name of Apache TLP sponsor. 
Incubation is required of all newly accepted projects until a further review 
indicates that the infrastructure, communications, and decision making process 
have stabilized in a manner consistent with other successful ASF projects. 
While incubation status is not necessarily a reflection of the completeness or 
stability of the code, it does  [...]
-
 .. toctree::
    :maxdepth: 2
    :caption: Contents:
 
    intro
+   collections
    quickstart
    build
    dockerimages
diff --git a/docs/intro.rst b/docs/intro.rst
index 5ebe701..508e75c 100644
--- a/docs/intro.rst
+++ b/docs/intro.rst
@@ -1,9 +1,5 @@
 .. _intro:
 
-.. warning::
-
-  Apache incubator-sdap-nexus is an effort undergoing incubation at The Apache 
Software Foundation (ASF), sponsored by the name of Apache TLP sponsor. 
Incubation is required of all newly accepted projects until a further review 
indicates that the infrastructure, communications, and decision making process 
have stabilized in a manner consistent with other successful ASF projects. 
While incubation status is not necessarily a reflection of the completeness or 
stability of the code, it does  [...]
-
 *******************
 About NEXUS
 *******************

Reply via email to