adutra commented on code in PR #4613:
URL: https://github.com/apache/polaris/pull/4613#discussion_r3389738729


##########
site/content/in-dev/unreleased/directories.md:
##########
@@ -0,0 +1,227 @@
+---
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+title: Directories
+type: docs
+weight: 450
+---
+
+## Overview
+
+Directories make objects (including unstructured data like images, videos, 
documents, and other objects) discoverable alongside
+structured Iceberg tables within a Polaris catalog. 
+A directory points to a base location/prefix on an object store and 
automatically tracks the objects it contains by maintaining 
+an Iceberg table with object-level metadata such as URI, size, content type, 
checksum, ...
+
+This means query engines and tools that already know how to read Iceberg 
tables can discover and
+access unstructured data with little or no extra work (accessing the object 
itself).
+
+## Concepts
+
+A directory has two main parts:
+
+1. **Directory configuration** — stored by the Polaris server. It describes 
_where_ the data lives,
+   how to authenticate, which objects to include, and how often to re-scan. 
The configuration "lives" in a namespace.
+2. **Directory table** — an Iceberg table serving as the inventory of all 
objects contained in the directory, one row per object discovered during a scan.
+   The directory table uses the configuration name.
+
+The Polaris server itself does not perform scans. Instead, external services 
(e.g. directory table scanning service) read the directory configuration 
through the REST API,
+walk the object store, and write the results into the directory table.
+
+## Configuration
+
+A directory is described by the following fields:
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `name` | `string` | Yes | The name of the directory. It is used in REST 
endpoint paths and becomes the name of the corresponding Iceberg table. |
+| `base_location` | `string` | Yes | The base location to use as a root for 
scanning objects in the object store (for example `s3://my-bucket/images/`, 
`gs://my-bucket/docs/`, or `file:///data/local/`). |
+| `filter` | `object` | No | Include and exclude patterns that control which 
objects are added to the directory table during a scan. See [Filter](#filter). |
+| `scan-schedule` | `object` | No | Object representing a scan schedule 
(trigger, cron, ...). |
+
+### Storage access
+
+Directories use Polaris `StorageAccessConfig` to access the object store.
+
+### Filter
+
+The `filter` object controls which objects are included in the directory table 
during a scan.
+
+If no filter is set, all objects found under the directory `base_location` are 
included.
+
+`include` and `exclude` are lists of regular expressions matched against the 
object's full URI. An
+object is included if it matches at least one `include` pattern (or `include` 
is omitted) and does
+not match any `exclude` pattern.
+
+**Example** — include only JPEG and PNG images, but exclude thumbnails:
+
+```json
+{
+  "filter": {
+    "include": [".*\\.jpg$", ".*\\.png$"],
+    "exclude": [".*thumbs/.*"]
+  }
+}
+```
+
+### Example: creating a directory
+
+```json
+POST /v1/{prefix}/namespaces/{namespace}/directories
+
+{
+  "name": "product-images",
+  "base_location": "s3://warehouse/product-images/",
+  "filter": {
+    "include": [".*\\.jpg$", ".*\\.png$"]
+  },
+  "scan-schedule": {
+    "cron": "0 * * * *"
+  }
+}
+```
+
+## REST endpoints
+
+Directories are managed through the following REST endpoints. All paths are 
relative to the catalog
+base URL.
+
+### List directories
+
+**GET** `/v1/{prefix}/namespaces/{namespace}/directories`
+
+Returns the list of directory identifiers in the given namespace.
+
+### Create a directory
+
+**POST** `/v1/{prefix}/namespaces/{namespace}/directories`
+
+Creates a new directory and its corresponding Iceberg table. The request body 
must contain the
+directory configuration (see [Configuration](#configuration)).
+
+### Get directory details
+
+**GET** `/v1/{prefix}/namespaces/{namespace}/directories/{directory}`
+
+Returns the full configuration of the specified directory, including filter 
and scan schedule.
+
+### Drop a directory
+
+**DELETE** `/v1/{prefix}/namespaces/{namespace}/directories/{directory}`
+
+Removes the directory and its associated Iceberg table from the namespace.
+
+## Directory table
+
+When a directory is created, Polaris creates an Iceberg table in the same 
namespace using the directory
+`name` as the table name. The table uses the following schema:
+
+| Field Id | Field Name | Type | Required | Description |
+|----------|------------|------|----------|-------------|
+| 1 | `file_uri` | `string` | Yes | The fully qualified URI of the object (for 
example `s3://my-bucket/images/photo.jpg`). |
+| 2 | `content_type` | `string` | No | The MIME content type (RFC 2045) of the 
object (for example `image/jpeg`, `application/pdf`). |
+| 3 | `size` | `long` | No | The size of the object in bytes. |
+| 4 | `checksum_algorithm` | `string` | No | The name of the checksum 
algorithm (for example `MD5`, `SHA-256`, `CRC32`). |
+| 5 | `checksum` | `string` | No | The object checksum value computed with the 
algorithm specified in `checksum_algorithm`. |
+| 6 | `last_modified` | `timestamptz` | No | The last modification timestamp 
of the object as reported by the object store. |
+| 7 | `metadata` | `map<string, string>` | No | Additional labels and tags 
from the object store (for example S3 user-defined metadata, storage class, or 
content encoding). |
+
+Clients (query engines, analytics tools, etc.) interact with the directory 
table as a regular Iceberg
+table. For instance, you can query it with Spark SQL:
+
+```sql
+SELECT file_uri, size, last_modified
+FROM my_catalog.my_namespace.product_images
+WHERE content_type = 'image/jpeg'
+  AND size > 1048576
+ORDER BY last_modified DESC;
+```
+
+### Credential vending
+
+The directory table might contain additional fields for credential vending.

Review Comment:
   I would like to understand how we envision credential vending. 
   
   Right now, if the spec was live, clients could request vended credentials 
*for the inventory table*, but would necessarily have to be in possession of 
their own storage credentials, in order to access the actual data.
   
   I think it would be a real added-value if Polaris could also be the 
governing entity that grants access to the remote files, via some form of 
credential vending.
   
   It's too soon to sort out the details, but my gut feeling is that we'll need 
a form of "fast authz checks", similar to what was proposed in #3995 for remote 
signing. In which case, it's likely that the inventory table would have to 
store some encrypted payload that will be used later for generating the storage 
credentials.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to