This is an automated email from the ASF dual-hosted git repository.
philo pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten.git
The following commit(s) were added to refs/heads/main by this push:
new bdc4998de [VL][DOC] Add ABFS doc (#5479)
bdc4998de is described below
commit bdc4998de7990812131a0f622f9fa206b169d203
Author: Ankita Victor <[email protected]>
AuthorDate: Thu Apr 25 07:04:54 2024 +0530
[VL][DOC] Add ABFS doc (#5479)
---
docs/get-started/Velox.md | 13 +++++++++++++
docs/get-started/VeloxABFS.md | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+)
diff --git a/docs/get-started/Velox.md b/docs/get-started/Velox.md
index 7e7a0bef6..c97cfe6fd 100644
--- a/docs/get-started/Velox.md
+++ b/docs/get-started/Velox.md
@@ -191,6 +191,19 @@ Here are two steps to enable kerberos.
The ticket cache file can be found by `klist`.
+## Azure Blob File System (ABFS) support
+
+Velox supports ABFS with the open source [Azure SDK for
C++](https://github.com/Azure/azure-sdk-for-cpp) and Gluten uses the Velox ABFS
connector to connect with ABFS.
+The build option for ABFS (enable_abfs) must be set to enable this feature as
listed below.
+
+```
+cd /path/to/gluten
+./dev/buildbundle-veloxbe.sh --enable_abfs=ON
+```
+
+Please refer [Velox ABFS](VeloxABFS.md) part for more detailed configurations.
+
+
## AWS S3 support
Velox supports S3 with the open source [AWS C++
SDK](https://github.com/aws/aws-sdk-cpp) and Gluten uses Velox S3 connector to
connect with S3.
diff --git a/docs/get-started/VeloxABFS.md b/docs/get-started/VeloxABFS.md
new file mode 100644
index 000000000..9bb9c8332
--- /dev/null
+++ b/docs/get-started/VeloxABFS.md
@@ -0,0 +1,35 @@
+---
+layout: page
+title: Using ABFS with Gluten
+nav_order: 6
+parent: Getting-Started
+---
+ABFS is an important data store for big data users. This doc discusses config
details and use cases of Gluten with ABFS. To use an ABFS account as your data
source, please ensure you use the listed ABFS config in your
spark-defaults.conf. If you would like to authenticate with ABFS using
additional auth mechanisms, please reach out using the 'Issues' tab.
+
+# Working with ABFS
+
+## Configuring ABFS Access Token
+
+To configure access to your storage account, replace <storage-account> with
the name of your account. This property aligns with Spark configurations. By
setting this config multiple times using different storage account names, you
can access multiple ABFS accounts.
+
+```sh
+spark.hadoop.fs.azure.account.key.<storage-account>.dfs.core.windows.net
XXXXXXXXX
+```
+
+### Other authentatication methods are not yet supported.
+
+# Local Caching support
+
+Velox supports a local cache when reading data from HDFS/S3/ABFS. With this
feature, Velox can asynchronously cache the data on local disk when reading
from remote storage and future read requests on previously cached blocks will
be serviced from local cache files. To enable the local caching feature, the
following configurations are required:
+
+```
+spark.gluten.sql.columnar.backend.velox.cacheEnabled // enable or disable
velox cache, default false.
+spark.gluten.sql.columnar.backend.velox.memCacheSize // the total size of
in-mem cache, default is 128MB.
+spark.gluten.sql.columnar.backend.velox.ssdCachePath // the folder to
store the cache files, default is "/tmp".
+spark.gluten.sql.columnar.backend.velox.ssdCacheSize // the total size of
the SSD cache, default is 128MB. Velox will do in-mem cache only if this value
is 0.
+spark.gluten.sql.columnar.backend.velox.ssdCacheShards // the shards of the
SSD cache, default is 1.
+spark.gluten.sql.columnar.backend.velox.ssdCacheIOThreads // the IO threads
for cache promoting, default is 1. Velox will try to do "read-ahead" if this
value is bigger than 1
+spark.gluten.sql.columnar.backend.velox.ssdODirect // enable or disable
O_DIRECT on cache write, default false.
+```
+
+It's recommended to mount SSDs to the cache path to get the best performance
of local caching. Cache files will be written to
"spark.gluten.sql.columnar.backend.velox.cachePath", with UUID based suffix,
e.g. "/tmp/cache.13e8ab65-3af4-46ac-8d28-ff99b2a9ec9b0". Gluten cannot reuse
older caches for now, and the old cache files are left after Spark context
shutdown.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]