squalud commented on issue #9365: URL: https://github.com/apache/incubator-gluten/issues/9365#issuecomment-2816420307
> It may because we don't enable s3 arrow compile option, https://github.com/apache/incubator-gluten/blob/main/dev/build_arrow.sh#L40 , Can you try -DARROW_S3=ON I find in the arrow build print message > > ``` > Project component options: > -- > -- ARROW_ACERO=OFF [default=OFF] > -- Build the Arrow Acero Engine Module > -- ARROW_AZURE=OFF [default=OFF] > -- Build Arrow with Azure support (requires the Azure SDK for C++) > -- ARROW_BUILD_UTILITIES=OFF [default=OFF] > -- Build Arrow commandline utilities > -- ARROW_COMPUTE=OFF [default=OFF] > -- Build all Arrow Compute kernels > -- ARROW_CSV=OFF [default=OFF] > -- Build the Arrow CSV Parser Module > -- ARROW_CUDA=OFF [default=OFF] > -- Build the Arrow CUDA extensions (requires CUDA toolkit) > -- ARROW_DATASET=OFF [default=OFF] > -- Build the Arrow Dataset Modules > -- ARROW_FILESYSTEM=ON [default=OFF] > -- Build the Arrow Filesystem Layer > -- ARROW_FLIGHT=OFF [default=OFF] > -- Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers) > -- ARROW_FLIGHT_SQL=OFF [default=OFF] > -- Build the Arrow Flight SQL extension > -- ARROW_GANDIVA=OFF [default=OFF] > -- Build the Gandiva libraries > -- ARROW_GCS=OFF [default=OFF] > -- Build Arrow with GCS support (requires the GCloud SDK for C++) > -- ARROW_HDFS=OFF [default=OFF] > -- Build the Arrow HDFS bridge > -- ARROW_IPC=ON [default=ON] > -- Build the Arrow IPC extensions > -- ARROW_JEMALLOC=OFF [default=ON] > -- Build the Arrow jemalloc-based allocator > -- ARROW_JSON=ON [default=OFF] > -- Build Arrow with JSON support (requires RapidJSON) > -- ARROW_MIMALLOC=OFF [default=OFF] > -- Build the Arrow mimalloc-based allocator > -- ARROW_PARQUET=ON [default=OFF] > -- Build the Parquet libraries > -- ARROW_ORC=OFF [default=OFF] > -- Build the Arrow ORC adapter > -- ARROW_PYTHON=OFF [default=OFF] > -- Build some components needed by PyArrow. > -- (This is a deprecated option. Use CMake presets instead.) > -- ARROW_S3=OFF [default=OFF] > -- Build Arrow with S3 support (requires the AWS SDK for C++) > ``` Yes, I've noticed that. So, before building, I changed the build_arrow.sh and modify_arrow.patch, shown by 'git diff': ``` diff --git a/dev/build_arrow.sh b/dev/build_arrow.sh index e7496350f..c2d3050b8 100755 --- a/dev/build_arrow.sh +++ b/dev/build_arrow.sh @@ -36,6 +36,7 @@ function build_arrow_cpp() { pushd $ARROW_PREFIX/cpp cmake_install \ + -DARROW_S3=ON \ -DARROW_PARQUET=ON \ -DARROW_FILESYSTEM=ON \ -DARROW_PROTOBUF_USE_SHARED=OFF \ diff --git a/ep/build-velox/src/modify_arrow.patch b/ep/build-velox/src/modify_arrow.patch index 7d4d8e557..3c3f34a38 100644 --- a/ep/build-velox/src/modify_arrow.patch +++ b/ep/build-velox/src/modify_arrow.patch @@ -104,8 +104,7 @@ index a8328576b..57f282c6c 100644 -DARROW_JSON=${ARROW_DATASET} -DARROW_ORC=${ARROW_ORC} -DARROW_PARQUET=${ARROW_PARQUET} -- -DARROW_S3=ON -+ -DARROW_S3=OFF + -DARROW_S3=ON + -DARROW_HDFS=ON -DARROW_SUBSTRAIT=${ARROW_DATASET} -DARROW_USE_CCACHE=ON ``` And in my build log, i can find out these messages: ``` ...... + pushd /workspace/incubator-gluten/dev/../ep/_ep/arrow_ep/cpp /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp /workspace/incubator-gluten/dev + cmake_install -DARROW_S3=ON -DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON -DARROW_PROTOBUF_USE_SHARED=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JEMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE -DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON ...... + COMPILER_FLAGS='-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 ' + cmake -Wno-dev -B_build -GNinja -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_CXX_STANDARD=17 '' '' '-DCMAKE_CXX_FLAGS=-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 ' -DBUILD_TESTING=OFF -DARROW_S3=ON -DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON -DARROW_PROTOBUF_USE_SHARED=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JEMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE -DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON ...... -- --------------------------------------------------------------------- -- Arrow version: 15.0.0 -- -- Build configuration summary: -- Generator: Ninja -- Build type: RELEASE -- Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp -- Install prefix: /usr/local -- -- Compile and link options: -- -- ARROW_CXXFLAGS="" [default=""] -- Compiler flags to append when compiling Arrow -- ARROW_BUILD_STATIC=ON [default=ON ...... -- ARROW_ACERO=OFF [default=OFF] -- Build the Arrow Acero Engine Module -- ARROW_AZURE=OFF [default=OFF] -- Build Arrow with Azure support (requires the Azure SDK for C++) -- ARROW_BUILD_UTILITIES=OFF [default=OFF] -- Build Arrow commandline utilities -- ARROW_COMPUTE=OFF [default=OFF] -- Build all Arrow Compute kernels -- ARROW_CSV=OFF [default=OFF] -- Build the Arrow CSV Parser Module -- ARROW_CUDA=OFF [default=OFF] -- Build the Arrow CUDA extensions (requires CUDA toolkit) -- ARROW_DATASET=OFF [default=OFF] -- Build the Arrow Dataset Modules -- ARROW_FILESYSTEM=ON [default=OFF] -- Build the Arrow Filesystem Layer -- ARROW_FLIGHT=OFF [default=OFF] -- Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers) -- ARROW_FLIGHT_SQL=OFF [default=OFF] -- Build the Arrow Flight SQL extension -- ARROW_GANDIVA=OFF [default=OFF] -- Build the Gandiva libraries -- ARROW_GCS=OFF [default=OFF] -- Build Arrow with GCS support (requires the GCloud SDK for C++) -- ARROW_HDFS=OFF [default=OFF] -- Build the Arrow HDFS bridge -- ARROW_IPC=ON [default=ON] -- Build the Arrow IPC extensions -- ARROW_JEMALLOC=OFF [default=ON] -- Build the Arrow jemalloc-based allocator -- ARROW_JSON=ON [default=OFF] -- Build Arrow with JSON support (requires RapidJSON) -- ARROW_MIMALLOC=OFF [default=OFF] -- Build the Arrow mimalloc-based allocator -- ARROW_PARQUET=ON [default=OFF] -- Build the Parquet libraries -- ARROW_ORC=OFF [default=OFF] -- Build the Arrow ORC adapter -- ARROW_PYTHON=OFF [default=OFF] -- Build some components needed by PyArrow. -- (This is a deprecated option. Use CMake presets instead.) -- ARROW_S3=ON [default=OFF] -- Build Arrow with S3 support (requires the AWS SDK for C++) -- ARROW_SKYHOOK=OFF [default=OFF] -- Build the Skyhook libraries -- ARROW_SUBSTRAIT=OFF [default=OFF] -- Build the Arrow Substrait Consumer Module -- ARROW_TENSORFLOW=OFF [default=OFF] -- Build Arrow with TensorFlow support enabled -- ARROW_TESTING=ON [default=OFF] -- Build the Arrow testing libraries ...... -- --------------------------------------------------------------------- -- Arrow version: 15.0.0 -- -- Build configuration summary: -- Generator: Unix Makefiles -- Build type: RELEASE -- Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp -- Install prefix: /workspace/incubator-gluten/ep/_ep/arrow_ep/java-dist -- -- Compile and link options: -- -- ARROW_CXXFLAGS="" [default=""] -- Compiler flags to append when compiling Arrow -- ARROW_BUILD_STATIC=ON [default=ON] -- Build static libraries ...... -- Project component options: -- -- ARROW_ACERO=ON [default=OFF] -- Build the Arrow Acero Engine Module -- ARROW_AZURE=OFF [default=OFF] -- Build Arrow with Azure support (requires the Azure SDK for C++) -- ARROW_BUILD_UTILITIES=OFF [default=OFF] -- Build Arrow commandline utilities -- ARROW_COMPUTE=ON [default=OFF] -- Build all Arrow Compute kernels -- ARROW_CSV=ON [default=OFF] -- Build the Arrow CSV Parser Module -- ARROW_CUDA=OFF [default=OFF] -- Build the Arrow CUDA extensions (requires CUDA toolkit) -- ARROW_DATASET=ON [default=OFF] -- Build the Arrow Dataset Modules -- ARROW_FILESYSTEM=ON [default=OFF] -- Build the Arrow Filesystem Layer -- ARROW_FLIGHT=OFF [default=OFF] -- Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers) -- ARROW_FLIGHT_SQL=OFF [default=OFF] -- Build the Arrow Flight SQL extension -- ARROW_GANDIVA=OFF [default=OFF] -- Build the Gandiva libraries -- ARROW_GCS=OFF [default=OFF] -- Build Arrow with GCS support (requires the GCloud SDK for C++) -- ARROW_HDFS=ON [default=OFF] -- Build the Arrow HDFS bridge -- ARROW_IPC=ON [default=ON] -- Build the Arrow IPC extensions -- ARROW_JEMALLOC=ON [default=ON] -- Build the Arrow jemalloc-based allocator -- ARROW_JSON=ON [default=OFF] -- Build Arrow with JSON support (requires RapidJSON) -- ARROW_MIMALLOC=OFF [default=OFF] -- Build the Arrow mimalloc-based allocator -- ARROW_PARQUET=ON [default=OFF] -- Build the Parquet libraries -- ARROW_ORC=OFF [default=OFF] -- Build the Arrow ORC adapter -- ARROW_PYTHON=OFF [default=OFF] -- Build some components needed by PyArrow. -- (This is a deprecated option. Use CMake presets instead.) -- ARROW_S3=ON [default=OFF] -- Build Arrow with S3 support (requires the AWS SDK for C++) -- ARROW_SKYHOOK=OFF [default=OFF] -- Build the Skyhook libraries -- ARROW_SUBSTRAIT=ON [default=OFF] -- Build the Arrow Substrait Consumer Module -- ARROW_TENSORFLOW=OFF [default=OFF] -- Build Arrow with TensorFlow support enabled -- ARROW_TESTING=OFF [default=OFF] -- Build the Arrow testing libraries ...... ``` All the `ARROW_S3` option in the build message are switch to `ON`. Additional Notes: 1) I run the spark code in spark-connect-server in K8S; The image is bitnami/spark; 2) Login the spark executor, by `nm` command: ``` $ nm -D /tmp/jnilib-17305424060615380389.tmp |grep _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE U _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE ``` 3) After extract the gluten jar, i got these libs ``` $ find ./ -name *.so ./linux/amd64/libvelox.so ./linux/amd64/libgluten.so ./x86_64/libarrow_cdata_jni.so ./x86_64/libarrow_dataset_jni.so $ nm -D x86_64/libarrow_dataset_jni.so |grep _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE U _ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE ``` It looks like the aws-cpp-sdk-s3 library is not statically linked in? Or do i need to install the related libs of `aws` in my Dockfile manually? Or there anything else I missed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
