This is an automated email from the ASF dual-hosted git repository.
kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 3cb4481386 GH-29847: [C++] Build with Azure SDK for C++ (#36835)
3cb4481386 is described below
commit 3cb4481386876c8004b2ab0b1bae58b658ef6ef0
Author: Thomas Newton <[email protected]>
AuthorDate: Wed Aug 30 09:30:02 2023 +0100
GH-29847: [C++] Build with Azure SDK for C++ (#36835)
### Rationale for this change
We want to use the Azure SDK for C++ to read/write to Azure blob storage.
Obviously this is pretty important for building an `AzureFileSystem`.
### What changes are included in this PR?
Builds the the relevant parts of the azure SDK as a cmake external project.
Adds a couple of simple tests that just assert that the Azure SDK is working
and a couple of lines in `AzureFileSystem` to initialise the blob storage
client to ensure the build is working correctly in all environments.
I started with the build setup from
https://github.com/apache/arrow/pull/12914 but I did make few changes.
1. Although its atypical for this project we chose to switch from cmake's
`ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure
docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content.
It also solves a few problems including: automatically linking system curl and
ssl instead of bootstrapping vcpkg and installing curl and ssl from there.
2. Only build one version of the Azure SDK for C++ because it contains all
the components. Previously we were unnecessarily building 5 different versions
of the whole thing on top of each other. This created race conditions for which
version each component came from.
3. We are using `azure-core_1.10.2` which is a very recent version. There
are a couple of important reasons for this 1. [an important managed identity
fix](https://github.com/Azure/azure-sdk-for-cpp/issues/4723), 2. [fixed support
for curl versions <
7.71.0](https://github.com/Azure/azure-sdk-for-cpp/issues/4792).
There will be follow up PRs to enable Azure in the manylinux builds. We
need to update `vcpkg` first so we can get a version of the Azure SDK which
contains [an important managed identity
fix](https://github.com/Azure/azure-sdk-for-cpp/issues/4723).
### Are these changes tested?
Yes. There is a simple test that just runs the Azure client against
azurite. Additionally just initialising the client in `AzureFileSystem` goes a
long way towards ensuring the build is working.
### Are there any user-facing changes?
No
* Closes: #29847
Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: shefali singh <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
---
ci/docker/ubuntu-20.04-cpp.dockerfile | 2 +
ci/docker/ubuntu-22.04-cpp.dockerfile | 2 +
ci/scripts/cpp_build.sh | 1 +
cpp/CMakeLists.txt | 5 +++
cpp/cmake_modules/FindAzure.cmake | 45 +++++++++++++++++++
cpp/cmake_modules/ThirdpartyToolchain.cmake | 68 +++++++++++++++++++++++++++++
cpp/src/arrow/filesystem/azurefs.cc | 9 ++++
cpp/src/arrow/filesystem/azurefs_test.cc | 45 ++++++++++++++++---
cpp/thirdparty/versions.txt | 3 ++
9 files changed, 174 insertions(+), 6 deletions(-)
diff --git a/ci/docker/ubuntu-20.04-cpp.dockerfile
b/ci/docker/ubuntu-20.04-cpp.dockerfile
index 125f1f48d4..08dda6cf50 100644
--- a/ci/docker/ubuntu-20.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-20.04-cpp.dockerfile
@@ -99,6 +99,7 @@ RUN apt-get update -y -q && \
libssl-dev \
libthrift-dev \
libutf8proc-dev \
+ libxml2-dev \
libzstd-dev \
make \
ninja-build \
@@ -172,6 +173,7 @@ ENV absl_SOURCE=BUNDLED \
ARROW_WITH_ZSTD=ON \
ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-${llvm}/bin/llvm-symbolizer \
AWSSDK_SOURCE=BUNDLED \
+ Azure_SOURCE=BUNDLED \
google_cloud_cpp_storage_SOURCE=BUNDLED \
gRPC_SOURCE=BUNDLED \
GTest_SOURCE=BUNDLED \
diff --git a/ci/docker/ubuntu-22.04-cpp.dockerfile
b/ci/docker/ubuntu-22.04-cpp.dockerfile
index 0840b3fa5c..dedeedd979 100644
--- a/ci/docker/ubuntu-22.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-22.04-cpp.dockerfile
@@ -98,6 +98,7 @@ RUN apt-get update -y -q && \
libssl-dev \
libthrift-dev \
libutf8proc-dev \
+ libxml2-dev \
libzstd-dev \
make \
ninja-build \
@@ -196,6 +197,7 @@ ENV absl_SOURCE=BUNDLED \
ARROW_WITH_ZSTD=ON \
ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-${llvm}/bin/llvm-symbolizer \
AWSSDK_SOURCE=BUNDLED \
+ Azure_SOURCE=BUNDLED \
google_cloud_cpp_storage_SOURCE=BUNDLED \
GTest_SOURCE=BUNDLED \
ORC_SOURCE=BUNDLED \
diff --git a/ci/scripts/cpp_build.sh b/ci/scripts/cpp_build.sh
index 5a89fafc60..d73f4ad230 100755
--- a/ci/scripts/cpp_build.sh
+++ b/ci/scripts/cpp_build.sh
@@ -152,6 +152,7 @@ cmake \
-DARROW_WITH_ZLIB=${ARROW_WITH_ZLIB:-OFF} \
-DARROW_WITH_ZSTD=${ARROW_WITH_ZSTD:-OFF} \
-DAWSSDK_SOURCE=${AWSSDK_SOURCE:-} \
+ -DAzure_SOURCE=${Azure_SOURCE:-} \
-Dbenchmark_SOURCE=${benchmark_SOURCE:-} \
-DBOOST_SOURCE=${BOOST_SOURCE:-} \
-DBrotli_SOURCE=${Brotli_SOURCE:-} \
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index fcff62c447..f8e7b1eb27 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -818,6 +818,11 @@ if(ARROW_WITH_OPENTELEMETRY)
list(APPEND ARROW_STATIC_INSTALL_INTERFACE_LIBS CURL::libcurl)
endif()
+if(ARROW_WITH_AZURE_SDK)
+ list(APPEND ARROW_SHARED_LINK_LIBS ${AZURE_SDK_LINK_LIBRARIES})
+ list(APPEND ARROW_STATIC_LINK_LIBS ${AZURE_SDK_LINK_LIBRARIES})
+endif()
+
if(ARROW_WITH_UTF8PROC)
list(APPEND ARROW_SHARED_LINK_LIBS utf8proc::utf8proc)
list(APPEND ARROW_STATIC_LINK_LIBS utf8proc::utf8proc)
diff --git a/cpp/cmake_modules/FindAzure.cmake
b/cpp/cmake_modules/FindAzure.cmake
new file mode 100644
index 0000000000..fdf354b724
--- /dev/null
+++ b/cpp/cmake_modules/FindAzure.cmake
@@ -0,0 +1,45 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+if(Azure_FOUND)
+ return()
+endif()
+
+set(find_package_args)
+list(APPEND find_package_args CONFIG)
+if(Azure_FIND_QUIETLY)
+ list(APPEND find_package_args QUIET)
+endif()
+
+if(Azure_FIND_REQUIRED)
+ list(APPEND find_package_args REQUIRED)
+endif()
+
+find_package(azure-core-cpp ${find_package_args})
+find_package(azure-identity-cpp ${find_package_args})
+find_package(azure-storage-blobs-cpp ${find_package_args})
+find_package(azure-storage-common-cpp ${find_package_args})
+find_package(azure-storage-files-datalake-cpp ${find_package_args})
+
+find_package_handle_standard_args(
+ Azure
+ REQUIRED_VARS azure-core-cpp_FOUND
+ azure-identity-cpp_FOUND
+ azure-storage-blobs-cpp_FOUND
+ azure-storage-common-cpp_FOUND
+ azure-storage-files-datalake-cpp_FOUND
+ VERSION_VAR azure-core-cpp_VERSION)
diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake
b/cpp/cmake_modules/ThirdpartyToolchain.cmake
index 3101c1dc73..5c2e679e10 100644
--- a/cpp/cmake_modules/ThirdpartyToolchain.cmake
+++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake
@@ -49,6 +49,7 @@ set(ARROW_RE2_LINKAGE
set(ARROW_THIRDPARTY_DEPENDENCIES
absl
AWSSDK
+ Azure
benchmark
Boost
Brotli
@@ -162,6 +163,8 @@ macro(build_dependency DEPENDENCY_NAME)
build_absl()
elseif("${DEPENDENCY_NAME}" STREQUAL "AWSSDK")
build_awssdk()
+ elseif("${DEPENDENCY_NAME}" STREQUAL "Azure")
+ build_azure_sdk()
elseif("${DEPENDENCY_NAME}" STREQUAL "benchmark")
build_benchmark()
elseif("${DEPENDENCY_NAME}" STREQUAL "Boost")
@@ -389,6 +392,10 @@ if(ARROW_GCS)
set(ARROW_WITH_ZLIB ON)
endif()
+if(ARROW_AZURE)
+ set(ARROW_WITH_AZURE_SDK ON)
+endif()
+
if(ARROW_JSON)
set(ARROW_WITH_RAPIDJSON ON)
endif()
@@ -569,6 +576,14 @@ else()
"${THIRDPARTY_MIRROR_URL}/aws-sdk-cpp-${ARROW_AWSSDK_BUILD_VERSION}.tar.gz")
endif()
+if(DEFINED ENV{ARROW_AZURE_SDK_URL})
+ set(ARROW_AZURE_SDK_URL "$ENV{ARROW_AZURE_SDK_URL}")
+else()
+ set_urls(ARROW_AZURE_SDK_URL
+
"https://github.com/Azure/azure-sdk-for-cpp/archive/${ARROW_AZURE_SDK_BUILD_VERSION}.tar.gz"
+ )
+endif()
+
if(DEFINED ENV{ARROW_BOOST_URL})
set(BOOST_SOURCE_URL "$ENV{ARROW_BOOST_URL}")
else()
@@ -981,6 +996,8 @@ else()
set(MAKE_BUILD_ARGS "-j${NPROC}")
endif()
+include(FetchContent)
+
# ----------------------------------------------------------------------
# Find pthreads
@@ -1388,6 +1405,7 @@ endif()
set(ARROW_OPENSSL_REQUIRED_VERSION "1.0.2")
set(ARROW_USE_OPENSSL OFF)
if(PARQUET_REQUIRE_ENCRYPTION
+ OR ARROW_AZURE
OR ARROW_FLIGHT
OR ARROW_GANDIVA
OR ARROW_GCS
@@ -5095,6 +5113,56 @@ if(ARROW_S3)
endif()
endif()
+# ----------------------------------------------------------------------
+# Azure SDK for C++
+
+function(build_azure_sdk)
+ message(STATUS "Building Azure SDK for C++ from source")
+ fetchcontent_declare(azure_sdk
+ URL ${ARROW_AZURE_SDK_URL}
+ URL_HASH
"SHA256=${ARROW_AZURE_SDK_BUILD_SHA256_CHECKSUM}")
+ set(BUILD_PERFORMANCE_TESTS FALSE)
+ set(BUILD_SAMPLES FALSE)
+ set(BUILD_TESTING FALSE)
+ set(BUILD_WINDOWS_UWP TRUE)
+ set(CMAKE_EXPORT_NO_PACKAGE_REGISTRY TRUE)
+ set(DISABLE_AZURE_CORE_OPENTELEMETRY TRUE)
+ set(ENV{AZURE_SDK_DISABLE_AUTO_VCPKG} TRUE)
+ set(WARNINGS_AS_ERRORS FALSE)
+ # TODO: Configure flags in a better way. FetchContent builds inherit
+ # global flags but we want to disable -Werror for Azure SDK for C++ builds.
+ if(MSVC)
+ string(REPLACE "/WX" "" CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG}")
+ string(REPLACE "/WX" "" CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")
+ else()
+ string(REPLACE "-Werror" "" CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG}")
+ string(REPLACE "-Werror" "" CMAKE_CXX_FLAGS_DEBUG
"${CMAKE_CXX_FLAGS_DEBUG}")
+ endif()
+ fetchcontent_makeavailable(azure_sdk)
+ set(AZURE_SDK_VENDORED
+ TRUE
+ PARENT_SCOPE)
+ list(APPEND
+ ARROW_BUNDLED_STATIC_LIBS
+ Azure::azure-core
+ Azure::azure-identity
+ Azure::azure-storage-blobs
+ Azure::azure-storage-common
+ Azure::azure-storage-files-datalake)
+ set(ARROW_BUNDLED_STATIC_LIBS
+ ${ARROW_BUNDLED_STATIC_LIBS}
+ PARENT_SCOPE)
+endfunction()
+
+if(ARROW_WITH_AZURE_SDK)
+ resolve_dependency(Azure REQUIRED_VERSION 1.10.2)
+ set(AZURE_SDK_LINK_LIBRARIES
+ Azure::azure-storage-files-datalake
+ Azure::azure-storage-common
+ Azure::azure-storage-blobs
+ Azure::azure-identity
+ Azure::azure-core)
+endif()
# ----------------------------------------------------------------------
# ucx - communication framework for modern, high-bandwidth and low-latency
networks
diff --git a/cpp/src/arrow/filesystem/azurefs.cc
b/cpp/src/arrow/filesystem/azurefs.cc
index 0158c0cec7..fcbae332d2 100644
--- a/cpp/src/arrow/filesystem/azurefs.cc
+++ b/cpp/src/arrow/filesystem/azurefs.cc
@@ -17,6 +17,9 @@
#include "arrow/filesystem/azurefs.h"
+#include <azure/identity/default_azure_credential.hpp>
+#include <azure/storage/blobs.hpp>
+
#include "arrow/result.h"
#include "arrow/util/checked_cast.h"
@@ -47,6 +50,12 @@ class AzureFileSystem::Impl {
: io_context_(io_context), options_(std::move(options)) {}
Status Init() {
+ // TODO: GH-18014 Delete this once we have a proper implementation. This
just
+ // initializes a pointless Azure blob service client with a fake endpoint
to ensure
+ // the build will fail if the Azure SDK build is broken.
+ auto default_credential =
std::make_shared<Azure::Identity::DefaultAzureCredential>();
+ auto service_client = Azure::Storage::Blobs::BlobServiceClient(
+ "http://fake-blob-storage-endpoint", default_credential);
if (options_.backend == AzureBackend::Azurite) {
// gen1Client_->GetAccountInfo().Value.IsHierarchicalNamespaceEnabled
// throws error in azurite
diff --git a/cpp/src/arrow/filesystem/azurefs_test.cc
b/cpp/src/arrow/filesystem/azurefs_test.cc
index e940c5bd1b..9bf7cb8e75 100644
--- a/cpp/src/arrow/filesystem/azurefs_test.cc
+++ b/cpp/src/arrow/filesystem/azurefs_test.cc
@@ -45,6 +45,12 @@
#include "arrow/testing/gtest_util.h"
#include "arrow/testing/util.h"
+#include <azure/identity/client_secret_credential.hpp>
+#include <azure/identity/default_azure_credential.hpp>
+#include <azure/identity/managed_identity_credential.hpp>
+#include <azure/storage/blobs.hpp>
+#include <azure/storage/common/storage_credential.hpp>
+
namespace arrow {
using internal::TemporaryDir;
namespace fs {
@@ -105,15 +111,42 @@ AzuriteEnv* GetAzuriteEnv() {
return ::arrow::internal::checked_cast<AzuriteEnv*>(azurite_env);
}
-// Placeholder tests for file structure
+// Placeholder tests
// TODO: GH-18014 Remove once a proper test is added
-TEST(AzureFileSystem, InitialiseAzurite) {
+TEST(AzureFileSystem, UploadThenDownload) {
+ const std::string container_name = "sample-container";
+ const std::string blob_name = "sample-blob.txt";
+ const std::string blob_content = "Hello Azure!";
+
const std::string& account_name = GetAzuriteEnv()->account_name();
const std::string& account_key = GetAzuriteEnv()->account_key();
- EXPECT_EQ(account_name, "devstoreaccount1");
- EXPECT_EQ(account_key,
- "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/"
- "K1SZFPTOtr/KBHBeksoGMGw==");
+
+ auto credential =
std::make_shared<Azure::Storage::StorageSharedKeyCredential>(
+ account_name, account_key);
+
+ auto service_client = Azure::Storage::Blobs::BlobServiceClient(
+ std::string("http://127.0.0.1:10000/") + account_name, credential);
+ auto container_client =
service_client.GetBlobContainerClient(container_name);
+ container_client.CreateIfNotExists();
+ auto blob_client = container_client.GetBlockBlobClient(blob_name);
+
+ std::vector<uint8_t> buffer(blob_content.begin(), blob_content.end());
+ blob_client.UploadFrom(buffer.data(), buffer.size());
+
+ std::vector<uint8_t> downloaded_content(blob_content.size());
+ blob_client.DownloadTo(downloaded_content.data(), downloaded_content.size());
+
+ EXPECT_EQ(std::string(downloaded_content.begin(), downloaded_content.end()),
+ blob_content);
+}
+
+TEST(AzureFileSystem, InitializeCredentials) {
+ auto default_credential =
std::make_shared<Azure::Identity::DefaultAzureCredential>();
+ auto managed_identity_credential =
+ std::make_shared<Azure::Identity::ManagedIdentityCredential>();
+ auto service_principal_credential =
+ std::make_shared<Azure::Identity::ClientSecretCredential>("tenant_id",
"client_id",
+
"client_secret");
}
TEST(AzureFileSystem, OptionsCompare) {
diff --git a/cpp/thirdparty/versions.txt b/cpp/thirdparty/versions.txt
index 8edaa422b3..52d302592b 100644
--- a/cpp/thirdparty/versions.txt
+++ b/cpp/thirdparty/versions.txt
@@ -53,6 +53,9 @@ ARROW_AWS_LC_BUILD_VERSION=v1.3.0
ARROW_AWS_LC_BUILD_SHA256_CHECKSUM=ae96a3567161552744fc0cae8b4d68ed88b1ec0f3d3c98700070115356da5a37
ARROW_AWSSDK_BUILD_VERSION=1.10.55
ARROW_AWSSDK_BUILD_SHA256_CHECKSUM=2d552fb1a84bef4a9b65e34aa7031851ed2aef5319e02cc6e4cb735c48aa30de
+# Despite the confusing version name this is still the whole Azure SDK for C++
including core, keyvault, storage-common, etc.
+ARROW_AZURE_SDK_BUILD_VERSION=azure-core_1.10.2
+ARROW_AZURE_SDK_BUILD_SHA256_CHECKSUM=36557dae87de4cdd257d9b441d9a7f043290eae6666fb1065e0fa486ae3e58a0
ARROW_BOOST_BUILD_VERSION=1.81.0
ARROW_BOOST_BUILD_SHA256_CHECKSUM=9e0ffae35528c35f90468997bc8d99500bf179cbae355415a89a600c38e13574
ARROW_BROTLI_BUILD_VERSION=v1.0.9