This is an automated email from the ASF dual-hosted git repository.
kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 1e9f224520 GH-35260: [C++][Python][R] Allow users to adjust S3 log
level by environment variable (#38267)
1e9f224520 is described below
commit 1e9f224520ad1ffe836c9bab3c438c3d110c00ea
Author: Bryce Mecum <[email protected]>
AuthorDate: Mon Oct 16 23:31:48 2023 -0800
GH-35260: [C++][Python][R] Allow users to adjust S3 log level by
environment variable (#38267)
### Rationale for this change
It's useful when troubleshooting issues with Arrow's S3 filesystem
implementation to raise the log level. Currently, this can only be done in C++
and Python, but not from R. In addition, the log level can only be set during
S3 initialization and not directly so the user has to introduce explicit S3
initialization code to turn on logging and must make sure this code is called
before S3 initialization.
While discussing exposing control of log level to R, we realized that
allowing the log level to be controlled by environment variable may be more
intuitive and useful and would just be a good addition for C++, Python, and R.
### What changes are included in this PR?
- A new environment variable `AWS_S3_LOG_LEVEL` with documentation for
controlling S3 log level
- Updated documentation for C++, Python, and R
- A new `InitializeS3()` as a quality-of-life thing for C++ users. Feel
free to ask me to remove this.
No changes are needed directly for Python and R because these
implementation uses the internal implicit initializer `EnsureS3Initialized`
rather than the explicit form, `InitializeS3`. And it's the behavior of the
`EnsureS3Initialized` routine that's changed here.
### Are these changes tested?
Yes. I added a unit test for the new `GetS3LogLevelFromEnvOrDefault` and
tested from Python and R manually. I didn't add a test to make sure the
underlying `AwsInstance` gets set up correctly because it looked like it would
require a refactor and didn't seem worth it.
### Are there any user-facing changes?
Yes. A new way to turn on logging for S3 and matching docs in C++, Python,
and R.
* Closes: #35260
Lead-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
---
cpp/src/arrow/filesystem/s3fs.cc | 32 +++++++++++++++++++++++++++++++-
cpp/src/arrow/filesystem/s3fs.h | 8 +++++++-
cpp/src/arrow/filesystem/s3fs_test.cc | 26 ++++++++++++++++++++++++++
docs/source/cpp/api/filesystem.rst | 2 ++
docs/source/cpp/env_vars.rst | 22 ++++++++++++++++++++++
docs/source/python/filesystems.rst | 8 ++++++++
r/R/filesystem.R | 22 ++++++++++++++++++++++
r/man/FileSystem.Rd | 8 ++++++++
r/man/s3_bucket.Rd | 13 +++++++++++++
9 files changed, 139 insertions(+), 2 deletions(-)
diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 08fbcde6fd..26a1530660 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -2987,7 +2987,7 @@ Status InitializeS3(const S3GlobalOptions& options) {
}
Status EnsureS3Initialized() {
- return EnsureAwsInstanceInitialized({S3LogLevel::Fatal}).status();
+ return EnsureAwsInstanceInitialized(S3GlobalOptions::Defaults()).status();
}
Status FinalizeS3() {
@@ -3001,6 +3001,36 @@ bool IsS3Initialized() { return
GetAwsInstance()->IsInitialized(); }
bool IsS3Finalized() { return GetAwsInstance()->IsFinalized(); }
+S3GlobalOptions S3GlobalOptions::Defaults() {
+ auto log_level = S3LogLevel::Fatal;
+
+ auto result = arrow::internal::GetEnvVar("ARROW_S3_LOG_LEVEL");
+
+ if (result.ok()) {
+ // Extract, trim, and downcase the value of the enivronment variable
+ auto value =
+
arrow::internal::AsciiToLower(arrow::internal::TrimString(result.ValueUnsafe()));
+
+ if (value == "fatal") {
+ log_level = S3LogLevel::Fatal;
+ } else if (value == "error") {
+ log_level = S3LogLevel::Error;
+ } else if (value == "warn") {
+ log_level = S3LogLevel::Warn;
+ } else if (value == "info") {
+ log_level = S3LogLevel::Info;
+ } else if (value == "debug") {
+ log_level = S3LogLevel::Debug;
+ } else if (value == "trace") {
+ log_level = S3LogLevel::Trace;
+ } else if (value == "off") {
+ log_level = S3LogLevel::Off;
+ }
+ }
+
+ return S3GlobalOptions{log_level};
+}
+
// -----------------------------------------------------------------------
// Top-level utility functions
diff --git a/cpp/src/arrow/filesystem/s3fs.h b/cpp/src/arrow/filesystem/s3fs.h
index cc870c5abe..9900a9a1c0 100644
--- a/cpp/src/arrow/filesystem/s3fs.h
+++ b/cpp/src/arrow/filesystem/s3fs.h
@@ -332,9 +332,15 @@ struct ARROW_EXPORT S3GlobalOptions {
///
/// For more details see Aws::Crt::Io::EventLoopGroup
int num_event_loop_threads = 1;
+
+ /// \brief Initialize with default options
+ ///
+ /// For log_level, this method first tries to extract a suitable value from
the
+ /// environment variable ARROW_S3_LOG_LEVEL.
+ static S3GlobalOptions Defaults();
};
-/// \brief Initialize the S3 APIs.
+/// \brief Initialize the S3 APIs with the specified set of options.
///
/// It is required to call this function at least once before using
S3FileSystem.
///
diff --git a/cpp/src/arrow/filesystem/s3fs_test.cc
b/cpp/src/arrow/filesystem/s3fs_test.cc
index f88ee7eef9..b789845bd1 100644
--- a/cpp/src/arrow/filesystem/s3fs_test.cc
+++ b/cpp/src/arrow/filesystem/s3fs_test.cc
@@ -1380,5 +1380,31 @@ class TestS3FSGeneric : public S3TestMixin, public
GenericFileSystemTest {
GENERIC_FS_TEST_FUNCTIONS(TestS3FSGeneric);
+////////////////////////////////////////////////////////////////////////////
+// S3GlobalOptions::Defaults tests
+
+TEST(S3GlobalOptions, DefaultsLogLevel) {
+ // Verify we get the default value of Fatal
+ ASSERT_EQ(S3LogLevel::Fatal,
arrow::fs::S3GlobalOptions::Defaults().log_level);
+
+ // Verify we get the value specified by env var and not the default
+ {
+ EnvVarGuard log_level_guard("ARROW_S3_LOG_LEVEL", "ERROR");
+ ASSERT_EQ(S3LogLevel::Error,
arrow::fs::S3GlobalOptions::Defaults().log_level);
+ }
+
+ // Verify we trim and case-insensitively compare the environment variable's
value
+ {
+ EnvVarGuard log_level_guard("ARROW_S3_LOG_LEVEL", " eRrOr ");
+ ASSERT_EQ(S3LogLevel::Error,
arrow::fs::S3GlobalOptions::Defaults().log_level);
+ }
+
+ // Verify we get the default value of Fatal if our env var is invalid
+ {
+ EnvVarGuard log_level_guard("ARROW_S3_LOG_LEVEL", "invalid");
+ ASSERT_EQ(S3LogLevel::Fatal,
arrow::fs::S3GlobalOptions::Defaults().log_level);
+ }
+}
+
} // namespace fs
} // namespace arrow
diff --git a/docs/source/cpp/api/filesystem.rst
b/docs/source/cpp/api/filesystem.rst
index 71e102b0f6..8132af42e2 100644
--- a/docs/source/cpp/api/filesystem.rst
+++ b/docs/source/cpp/api/filesystem.rst
@@ -66,6 +66,8 @@ S3 filesystem
.. doxygenclass:: arrow::fs::S3FileSystem
:members:
+.. doxygenfunction:: arrow::fs::InitializeS3(const S3GlobalOptions& options)
+
Hadoop filesystem
-----------------
diff --git a/docs/source/cpp/env_vars.rst b/docs/source/cpp/env_vars.rst
index b4d93c7ead..f116effeb2 100644
--- a/docs/source/cpp/env_vars.rst
+++ b/docs/source/cpp/env_vars.rst
@@ -85,6 +85,28 @@ that changing their value later will have an effect.
``libhdfs.dylib`` on macOS, ``libhdfs.so`` on other platforms).
Alternatively, one can set :envvar:`HADOOP_HOME`.
+.. envvar:: ARROW_S3_LOG_LEVEL
+
+ Controls the verbosity of logging produced by S3 calls. Defaults to
``FATAL``
+ which only produces output in the case of fatal errors. ``DEBUG`` is
recommended
+ when you're trying to troubleshoot issues.
+
+ Possible values include:
+
+ - ``FATAL`` (the default)
+ - ``ERROR``
+ - ``WARN``
+ - ``INFO``
+ - ``DEBUG``
+ - ``TRACE``
+ - ``OFF``
+
+ .. seealso::
+
+ `Logging - AWS SDK For C++
+
<https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/logging.html>`__
+
+
.. envvar:: ARROW_TRACING_BACKEND
The backend where to export `OpenTelemetry
<https://opentelemetry.io/>`_-based
diff --git a/docs/source/python/filesystems.rst
b/docs/source/python/filesystems.rst
index 3fc10dc771..5309250351 100644
--- a/docs/source/python/filesystems.rst
+++ b/docs/source/python/filesystems.rst
@@ -207,6 +207,14 @@ Here are a couple examples in code::
:func:`pyarrow.fs.resolve_s3_region` for resolving region from a bucket
name.
+Troubleshooting
+~~~~~~~~~~~~~~~
+
+When using :class:`S3FileSystem`, output is only produced for fatal errors or
+when printing return values. For troubleshooting, the log level can be set
using
+the environment variable ``ARROW_S3_LOG_LEVEL``. The log level must be set
prior
+to running any code that interacts with S3. Possible values include ``FATAL``
(the
+default), ``ERROR``, ``WARN``, ``INFO``, ``DEBUG`` (recommended), ``TRACE``,
and ``OFF``.
.. _filesystem-gcs:
diff --git a/r/R/filesystem.R b/r/R/filesystem.R
index eed9e95162..e0f370ad60 100644
--- a/r/R/filesystem.R
+++ b/r/R/filesystem.R
@@ -239,6 +239,14 @@ FileSelector$create <- function(base_dir, allow_not_found
= FALSE, recursive = F
#' and no resource tags. To have more control over how buckets are created,
#' use a different API to create them.
#'
+#' On S3FileSystem, output is only produced for fatal errors or when printing
+#' return values. For troubleshooting, the log level can be set using the
+#' environment variable `ARROW_S3_LOG_LEVEL` (e.g.,
+#' `Sys.setenv("ARROW_S3_LOG_LEVEL"="DEBUG")`). The log level must be set prior
+#' to running any code that interacts with S3. Possible values include 'FATAL'
+#' (the default), 'ERROR', 'WARN', 'INFO', 'DEBUG' (recommended), 'TRACE', and
+#' 'OFF'.
+#'
#' @usage NULL
#' @format NULL
#' @docType class
@@ -462,11 +470,25 @@ default_s3_options <- list(
#'
#' @param bucket string S3 bucket name or path
#' @param ... Additional connection options, passed to `S3FileSystem$create()`
+#'
+#' @details By default, \code{\link{s3_bucket}} and other
+#' \code{\link{S3FileSystem}} functions only produce output for fatal errors
+#' or when printing their return values. When troubleshooting problems, it may
+#' be useful to increase the log level. See the Notes section in
+#' \code{\link{S3FileSystem}} for more information or see Examples below.
+#'
#' @return A `SubTreeFileSystem` containing an `S3FileSystem` and the bucket's
#' relative path. Note that this function's success does not guarantee that you
#' are authorized to access the bucket's contents.
#' @examplesIf FALSE
#' bucket <- s3_bucket("voltrondata-labs-datasets")
+#'
+#' @examplesIf FALSE
+#' # Turn on debug logging. The following line of code should be run in a fresh
+#' # R session prior to any calls to `s3_bucket()` (or other S3 functions)
+#' Sys.setenv("ARROW_S3_LOG_LEVEL", "DEBUG")
+#' bucket <- s3_bucket("voltrondata-labs-datasets")
+#'
#' @export
s3_bucket <- function(bucket, ...) {
assert_that(is.string(bucket))
diff --git a/r/man/FileSystem.Rd b/r/man/FileSystem.Rd
index 6ebbecd992..b71d95f423 100644
--- a/r/man/FileSystem.Rd
+++ b/r/man/FileSystem.Rd
@@ -149,5 +149,13 @@ it does not pass any non-default settings. In AWS S3, the
bucket and all
objects will be not publicly visible, and will have no bucket policies
and no resource tags. To have more control over how buckets are created,
use a different API to create them.
+
+On S3FileSystem, output is only produced for fatal errors or when printing
+return values. For troubleshooting, the log level can be set using the
+environment variable \code{ARROW_S3_LOG_LEVEL} (e.g.,
+\code{Sys.setenv("ARROW_S3_LOG_LEVEL"="DEBUG")}). The log level must be set
prior
+to running any code that interacts with S3. Possible values include 'FATAL'
+(the default), 'ERROR', 'WARN', 'INFO', 'DEBUG' (recommended), 'TRACE', and
+'OFF'.
}
diff --git a/r/man/s3_bucket.Rd b/r/man/s3_bucket.Rd
index 2ab7d4962e..1b30a5cde1 100644
--- a/r/man/s3_bucket.Rd
+++ b/r/man/s3_bucket.Rd
@@ -21,8 +21,21 @@ are authorized to access the bucket's contents.
that automatically detects the bucket's AWS region and holding onto the its
relative path.
}
+\details{
+By default, \code{\link{s3_bucket}} and other
+\code{\link{S3FileSystem}} functions only produce output for fatal errors
+or when printing their return values. When troubleshooting problems, it may
+be useful to increase the log level. See the Notes section in
+\code{\link{S3FileSystem}} for more information or see Examples below.
+}
\examples{
\dontshow{if (FALSE) (if (getRversion() >= "3.4") withAutoprint else force)(\{
# examplesIf}
bucket <- s3_bucket("voltrondata-labs-datasets")
\dontshow{\}) # examplesIf}
+\dontshow{if (FALSE) (if (getRversion() >= "3.4") withAutoprint else force)(\{
# examplesIf}
+# Turn on debug logging. The following line of code should be run in a fresh
+# R session prior to any calls to `s3_bucket()` (or other S3 functions)
+Sys.setenv("ARROW_S3_LOG_LEVEL", "DEBUG")
+bucket <- s3_bucket("voltrondata-labs-datasets")
+\dontshow{\}) # examplesIf}
}