(incubator-gluten) branch main updated: [VL][DOC] Add ABFS doc (#5479)

philo Wed, 24 Apr 2024 18:35:13 -0700

This is an automated email from the ASF dual-hosted git repository.

philo pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten.git



The following commit(s) were added to refs/heads/main by this push:
     new bdc4998de [VL][DOC] Add ABFS doc (#5479)
bdc4998de is described below

commit bdc4998de7990812131a0f622f9fa206b169d203
Author: Ankita Victor <[email protected]>
AuthorDate: Thu Apr 25 07:04:54 2024 +0530

    [VL][DOC] Add ABFS doc (#5479)
---
 docs/get-started/Velox.md     | 13 +++++++++++++
 docs/get-started/VeloxABFS.md | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/docs/get-started/Velox.md b/docs/get-started/Velox.md
index 7e7a0bef6..c97cfe6fd 100644
--- a/docs/get-started/Velox.md
+++ b/docs/get-started/Velox.md
@@ -191,6 +191,19 @@ Here are two steps to enable kerberos.
 
 The ticket cache file can be found by `klist`.
 
+## Azure Blob File System (ABFS) support
+
+Velox supports ABFS with the open source [Azure SDK for 
C++](https://github.com/Azure/azure-sdk-for-cpp) and Gluten uses the Velox ABFS 
connector to connect with ABFS.
+The build option for ABFS (enable_abfs) must be set to enable this feature as 
listed below.
+
+```
+cd /path/to/gluten
+./dev/buildbundle-veloxbe.sh --enable_abfs=ON
+```
+
+Please refer [Velox ABFS](VeloxABFS.md) part for more detailed configurations.
+
+
 ## AWS S3 support
 
 Velox supports S3 with the open source [AWS C++ 
SDK](https://github.com/aws/aws-sdk-cpp) and Gluten uses Velox S3 connector to 
connect with S3.
diff --git a/docs/get-started/VeloxABFS.md b/docs/get-started/VeloxABFS.md
new file mode 100644
index 000000000..9bb9c8332
--- /dev/null
+++ b/docs/get-started/VeloxABFS.md
@@ -0,0 +1,35 @@
+---
+layout: page
+title: Using ABFS with Gluten
+nav_order: 6
+parent: Getting-Started
+---
+ABFS is an important data store for big data users. This doc discusses config 
details and use cases of Gluten with ABFS. To use an ABFS account as your data 
source, please ensure you use the listed ABFS config in your 
spark-defaults.conf. If you would like to authenticate with ABFS using 
additional auth mechanisms, please reach out using the 'Issues' tab.
+
+# Working with ABFS
+
+## Configuring ABFS Access Token
+
+To configure access to your storage account, replace <storage-account> with 
the name of your account. This property aligns with Spark configurations. By 
setting this config multiple times using different storage account names, you 
can access multiple ABFS accounts.
+
+```sh
+spark.hadoop.fs.azure.account.key.<storage-account>.dfs.core.windows.net  
XXXXXXXXX
+```
+
+### Other authentatication methods are not yet supported.
+
+# Local Caching support
+
+Velox supports a local cache when reading data from HDFS/S3/ABFS. With this 
feature, Velox can asynchronously cache the data on local disk when reading 
from remote storage and future read requests on previously cached blocks will 
be serviced from local cache files. To enable the local caching feature, the 
following configurations are required:
+
+```
+spark.gluten.sql.columnar.backend.velox.cacheEnabled      // enable or disable 
velox cache, default false.
+spark.gluten.sql.columnar.backend.velox.memCacheSize      // the total size of 
in-mem cache, default is 128MB.
+spark.gluten.sql.columnar.backend.velox.ssdCachePath      // the folder to 
store the cache files, default is "/tmp".
+spark.gluten.sql.columnar.backend.velox.ssdCacheSize      // the total size of 
the SSD cache, default is 128MB. Velox will do in-mem cache only if this value 
is 0.
+spark.gluten.sql.columnar.backend.velox.ssdCacheShards    // the shards of the 
SSD cache, default is 1.
+spark.gluten.sql.columnar.backend.velox.ssdCacheIOThreads // the IO threads 
for cache promoting, default is 1. Velox will try to do "read-ahead" if this 
value is bigger than 1 
+spark.gluten.sql.columnar.backend.velox.ssdODirect        // enable or disable 
O_DIRECT on cache write, default false.
+```
+
+It's recommended to mount SSDs to the cache path to get the best performance 
of local caching. Cache files will be written to 
"spark.gluten.sql.columnar.backend.velox.cachePath", with UUID based suffix, 
e.g. "/tmp/cache.13e8ab65-3af4-46ac-8d28-ff99b2a9ec9b0". Gluten cannot reuse 
older caches for now, and the old cache files are left after Spark context 
shutdown.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(incubator-gluten) branch main updated: [VL][DOC] Add ABFS doc (#5479)

Reply via email to