This is an automated email from the ASF dual-hosted git repository.
rkk pushed a commit to branch develop
in repository https://gitbox.apache.org/repos/asf/sdap-nexus.git
The following commit(s) were added to refs/heads/develop by this push:
new d7fea77 SDAP-518 - Collection Config Docs (#311)
d7fea77 is described below
commit d7fea77ba3f69a2440c5ad66fb07e3ec9f464c0b
Author: Riley Kuttruff <[email protected]>
AuthorDate: Thu Jun 6 16:07:51 2024 -0700
SDAP-518 - Collection Config Docs (#311)
* Initial work on CC docs
* Add collections to index toctree
* YAML highlighting
* NetCDF section
* remove incubation msg from intro.rst
* Added recs for gridded tile size
---------
Co-authored-by: rileykk <[email protected]>
---
docs/collections.rst | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++
docs/index.rst | 7 +--
docs/intro.rst | 4 --
3 files changed, 166 insertions(+), 9 deletions(-)
diff --git a/docs/collections.rst b/docs/collections.rst
new file mode 100644
index 0000000..becd38f
--- /dev/null
+++ b/docs/collections.rst
@@ -0,0 +1,164 @@
+.. _collections:
+
+***********************
+Collection Config Guide
+***********************
+
+Introduction
+============
+
+The Collection Config is a configuration file that defines collections to be
ingested and maintained in SDAP. Currently,
+it supports defining collections of NetCDF data that will be processed into
the custom NEXUS protobuf tile format or gridded
+Zarr data which can be used by SDAP directly with no need for processing. SDAP
Ingester currently supports source data stored
+in AWS S3 or on the local filesystem (currently, however, not both at the same
time).
+
+This guide will explain how to set up both protobuf and Zarr collections.
+
+.. _collections-basics:
+
+Basic Structure
+===============
+
+The Collection Config is a YAML file containing a single list named
``collections``:
+
+.. code-block:: yaml
+
+ collections: []
+
+The items in this list are the collections defined and they have the basic
structure:
+
+.. code-block:: yaml
+
+ - id: <single variable collection name>
+ path: <root collection location. Local path or S3 URI>
+ priority: <queue priority>
+ projection: <Grid | Swath>
+ dimensionNames:
+ latitude: <name of the latitude coordinate in the data>
+ longitude: <name of the longitude coordinate in the data>
+ time: <name of the time coordinate in the data>
+ variable: <variable name>
+ - id: <multi variable collection name>
+ path: <root collection location. Local path or S3 URI>
+ priority: <queue priority>
+ projection: <GridMulti | SwathMulti>
+ dimensionNames:
+ latitude: <name of the latitude coordinate in the data>
+ longitude: <name of the longitude coordinate in the data>
+ time: <name of the time coordinate in the data>
+ variables:
+ - <variable name 1>
+ - <variable name 2>
+ - <variable name 3>
+
+There are slight variations and additions to this structure depending on the
type of collection, which will be covered below.
+
+.. _collections-nc:
+
+NetCDF - Protobuf Collections
+=============================
+
+For NetCDF data, you'll also need to tell the Ingester how big you want to
make the tiles. This is set with the ``slices``
+object, which is a dictionary mapping dimension names to slice lengths.
Omitted dimensions are assumed to be 1. It is important
+to set tile sizes that are not too big as to result in excess unnecessary data
transfer, but also not too small as to result in
+an explosion in the number of generated tiles, which will lead to excessive
metadata storage overhead and possible performance
+degradations. For gridded data, we recommend tile sizes between 30 x 30 and
100 x 100, we also strongly recommend swath tiles be
+sized no larger than 15 x 15, as the current method for handling swath data is
very memory inefficient scaled rapidly by tile size.
+
+.. note:: The source dataset dimension names are used in slice definitions,
not the coordinate names as in the ``dimensionNames`` object. In gridded
datasets, these names are often the same, but this is not the case for swath
data.
+
+Example:
+
+.. code-block:: yaml
+
+ collections:
+ - id: MUR25-JPL-L4-GLOB-v04.2
+ path: s3://mur-sst/zarr-v1/
+ priority: 1
+ projection: Grid
+ dimensionNames:
+ latitude: lat
+ longitude: lon
+ time: time
+ variable: analysed_sst
+ slices:
+ lat: 100
+ lon: 100
+ time: 1
+ - id: ASCATB-L2-Coastal
+ path: s3://example-bucket/swath-path/
+ priority: 1
+ projection: SwathMulti
+ dimensionNames:
+ latitude: lat
+ longitude: lon
+ time: time
+ variables:
+ - wind_speed
+ - wind_dir
+ slices:
+ NUMROWS: 15
+ NUMROWS: 15
+
+
+.. _collections-zarr:
+
+Zarr Collections
+================
+
+To specify a collection as a Zarr collection, simply add ``storeType: zarr``
to the collection object. If the data is local,
+this is all you need to do.
+
+.. code-block:: yaml
+
+ id: <collection name>
+ path: <root collection location. Local path>
+ priority: <queue priority>
+ projection: <Grid | GridMulti>
+ storeType: zarr
+ dimensionNames:
+ latitude: <name of the latitude coordinate in the data>
+ longitude: <name of the longitude coordinate in the data>
+ time: <name of the time coordinate in the data>
+ variable: <variable name>
+
+For data in S3, you need to provide information on how to access the data.
This is currently done with the ``config.aws`` object.
+
+You will need to provide credentials to access the bucket, or specify if it is
public:
+
+Example:
+
+.. code-block:: yaml
+
+ collections:
+ - id: MUR_SST
+ path: s3://mur-sst/zarr-v1/
+ priority: 1
+ projection: Grid
+ storeType: zarr
+ dimensionNames:
+ latitude: lat
+ longitude: lon
+ time: time
+ variable: analysed_sst
+ config:
+ aws:
+ public: true
+ - id: private_data
+ path: s3://example-bucket/zarr/path/
+ priority: 1
+ projection: GridMulti
+ storeType: zarr
+ dimensionNames:
+ latitude: lat
+ longitude: lon
+ time: time
+ variables:
+ - var1
+ - var2
+ - var3
+ config:
+ aws:
+ accessKeyID: <secret>
+ secretAccessKey: <secret>
+ public: false
diff --git a/docs/index.rst b/docs/index.rst
index 649d0e7..d942a6f 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,15 +1,12 @@
-Welcome to incubator-sdap-nexus's documentation!
+Welcome to the Apache SDAP project documentation!
================================================
-.. warning::
-
- Apache incubator-sdap-nexus is an effort undergoing incubation at The Apache
Software Foundation (ASF), sponsored by the name of Apache TLP sponsor.
Incubation is required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making process
have stabilized in a manner consistent with other successful ASF projects.
While incubation status is not necessarily a reflection of the completeness or
stability of the code, it does [...]
-
.. toctree::
:maxdepth: 2
:caption: Contents:
intro
+ collections
quickstart
build
dockerimages
diff --git a/docs/intro.rst b/docs/intro.rst
index 5ebe701..508e75c 100644
--- a/docs/intro.rst
+++ b/docs/intro.rst
@@ -1,9 +1,5 @@
.. _intro:
-.. warning::
-
- Apache incubator-sdap-nexus is an effort undergoing incubation at The Apache
Software Foundation (ASF), sponsored by the name of Apache TLP sponsor.
Incubation is required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making process
have stabilized in a manner consistent with other successful ASF projects.
While incubation status is not necessarily a reflection of the completeness or
stability of the code, it does [...]
-
*******************
About NEXUS
*******************