This is an automated email from the ASF dual-hosted git repository.
potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git
The following commit(s) were added to refs/heads/main by this push:
new 69bac3fba8 Improve docs on objectstorage (#35294)
69bac3fba8 is described below
commit 69bac3fba897f9e7b0af642c97f9af0987a875de
Author: Bolke de Bruin <[email protected]>
AuthorDate: Tue Oct 31 22:29:52 2023 +0100
Improve docs on objectstorage (#35294)
* Improve docs on objectstorage
Add more explanations, limitations and add example
of attaching a filesystem.
---
airflow/providers/amazon/aws/fs/s3.py | 2 +-
.../apache-airflow/core-concepts/objectstorage.rst | 74 +++++++++++++++++++++-
2 files changed, 74 insertions(+), 2 deletions(-)
diff --git a/airflow/providers/amazon/aws/fs/s3.py
b/airflow/providers/amazon/aws/fs/s3.py
index c2eefcc379..d64adc401f 100644
--- a/airflow/providers/amazon/aws/fs/s3.py
+++ b/airflow/providers/amazon/aws/fs/s3.py
@@ -52,7 +52,7 @@ def get_fs(conn_id: str | None) -> AbstractFileSystem:
raise ImportError(
"Airflow FS S3 protocol requires the s3fs library, but it is not
installed as it requires"
"aiobotocore. Please install the s3 protocol support library by
running: "
- "pip install apache-airflow[s3]"
+ "pip install apache-airflow-providers-amazon[s3fs]"
)
aws: AwsGenericHook = AwsGenericHook(aws_conn_id=conn_id, client_type="s3")
diff --git a/docs/apache-airflow/core-concepts/objectstorage.rst
b/docs/apache-airflow/core-concepts/objectstorage.rst
index 493df2675a..194db0477e 100644
--- a/docs/apache-airflow/core-concepts/objectstorage.rst
+++ b/docs/apache-airflow/core-concepts/objectstorage.rst
@@ -23,6 +23,12 @@ Object Storage
.. versionadded:: 2.8.0
+All major cloud providers offer persistent data storage in object stores.
These are not classic
+"POSIX" file systems. In order to store hundreds of petabytes of data without
any single points
+of failure, object stores replace the classic file system directory tree with
a simpler model
+of object-name => data. To enable remote access, operations on objects are
usually offered as
+(slow) HTTP REST operations.
+
Airflow provides a generic abstraction on top of object stores, like s3, gcs,
and azure blob storage.
This abstraction allows you to use a variety of object storage systems in your
DAGs without having to
change you code to deal with every different object storage system. In
addition, it allows you to use
@@ -38,6 +44,21 @@ scheme.
it depends on ``aiobotocore``, which is not installed by default as it can
create dependency
challenges with ``botocore``.
+Cloud Object Stores are not real file systems
+---------------------------------------------
+Object stores are not real file systems although they can appear so. They do
not support all the
+operations that a real file system does. Key differences are:
+
+* No guaranteed atomic rename operation. This means that if you move a file
from one location to another, it
+ will be copied and then deleted. If the copy fails, you will lose the file.
+* Directories are emulated and might make working with them slow. For example,
listing a directory might
+ require listing all the objects in the bucket and filtering them by prefix.
+* Seeking within a file may require significant call overhead hurting
performance or might not be supported at all.
+
+Airflow relies on `fsspec
<https://filesystem-spec.readthedocs.io/en/latest/>`_ to provide a consistent
+experience across different object storage systems. It implements local file
caching to speed up access.
+However, you should be aware of the limitations of object storage when
designing your DAGs.
+
.. _concepts:basic-use:
@@ -110,6 +131,38 @@ Leveraging XCOM, you can pass paths between tasks:
read >> write
+Configuration
+-------------
+
+In its basic use, the object storage abstraction does not require much
configuration and relies upon the
+standard Airflow connection mechanism. This means that you can use the
``conn_id`` argument to specify
+the connection to use. Any settings by the connection are pushed down to the
underlying implementation.
+For example, if you are using s3, you can specify the ``aws_access_key_id``
and ``aws_secret_access_key``
+but also add extra arguments like ``endpoint_url`` to specify a custom
endpoint.
+
+Alternative backends
+^^^^^^^^^^^^^^^^^^^^
+
+It is possible to configure an alternative backend for a scheme or protocol.
This is done by attaching
+a ``backend`` to the scheme. For example, to enable the databricks backend for
the ``dbfs`` scheme, you
+would do the following:
+
+.. code-block:: python
+
+ from airflow.io.store.path import ObjectStoragePath
+ from airflow.io.store import attach
+
+ from fsspec.implementations.dbfs import DBFSFileSystem
+
+ attach(protocol="dbfs", fs=DBFSFileSystem(instance="myinstance",
token="mytoken"))
+ base = ObjectStoragePath("dbfs://my-location/")
+
+
+.. note::
+ To reuse the registration across tasks make sure to attach the backend at
the top-level of your DAG.
+ Otherwise, the backend will not be available across multiple tasks.
+
+
.. _concepts:api:
Path-like API
@@ -222,6 +275,25 @@ Copying and Moving
This documents the expected behavior of the ``copy`` and ``move`` operations,
particularly for cross object store (e.g.
file -> s3) behavior. Each method copies or moves files or directories from a
``source`` to a ``target`` location.
The intended behavior is the same as specified by
-`fsspec <https://filesystem-spec.readthedocs.io/en/latest/copying.html>`_. For
cross object store directory copying,
+``fsspec``. For cross object store directory copying,
Airflow needs to walk the directory tree and copy each file individually. This
is done by streaming each file from the
source to the target.
+
+
+External Integrations
+---------------------
+
+Many other projects, like DuckDB, Apache Iceberg etc, can make use of the
object storage abstraction. Often this is
+done by passing the underlying ``fsspec`` implementation. For this this
purpose ``ObjectStoragePath`` exposes
+the ``fs`` property. For example, the following works with ``duckdb`` so that
the connection details from Airflow
+are used to connect to s3 and a parquet file, indicated by a
``ObjectStoragePath``, is read:
+
+.. code-block:: python
+
+ import duckdb
+ from airflow.io.store.path import ObjectStoragePath
+
+ path = ObjectStoragePath("s3://my-bucket/my-table.parquet",
conn_id="aws_default")
+ conn = duckdb.connect(database=":memory:")
+ conn.register_filesystem(path.fs)
+ conn.execute(f"CREATE OR REPLACE TABLE my_table AS SELECT * FROM
read_parquet('{path}")