[PR] [SPARK-48471][CORE] Improve documentation and usage guide for history server [spark]

via GitHub Thu, 30 May 2024 00:38:16 -0700


yaooqinn opened a new pull request, #46802:
URL: https://github.com/apache/spark/pull/46802


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
     7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
        
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the 
guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section 
is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster 
reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other 
DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   In this PR, we improve documentation and usage guide for the history server 
by:
   - Identify and print **unrecognized options** specified by users
   - Obtain and print all history server-related configurations dynamically 
instead of using an incomplete, outdated hardcoded list.
   - Ensure all configurations are documented for the usage guide
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   - Revise the help guide for the history server to make it more 
user-friendly. Missing configuration in the help guide is not always reachable 
in our official documentation. E.g. spark.history.fs.safemodeCheck.interval is 
still missing from the doc since added in 1.6.
   - Missusage shall be reported to users
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as 
the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes 
- provide the console output, description and/or an example to show the 
behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to 
the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No, the print style is still AS-IS with items increased
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions 
for the consistent environment, and the instructions could accord to: 
https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   #### without this pr
   
   ```
   Usage: ./sbin/start-history-server.sh [options]
   24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for TERM
   24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for HUP
   24/05/30 15:37:23 INFO SignalUtils: Registering signal handler for INT
   
   
   Options:
     --properties-file FILE      Path to a custom Spark properties file.
                                 Default is conf/spark-defaults.conf.
   
   Configuration options can be set by setting the corresponding JVM system 
property.
   History Server options are always available; additional options depend on 
the provider.
   
   History Server options:
   
     spark.history.ui.port              Port where server will listen for 
connections
                                        (default 18080)
     spark.history.acls.enable          Whether to enable view acls for all 
applications
                                        (default false)
     spark.history.provider             Name of history provider class 
(defaults to
                                        file system-based provider)
     spark.history.retainedApplications Max number of application UIs to keep 
loaded in memory
                                        (default 50)
   FsHistoryProvider options:
   
     spark.history.fs.logDirectory      Directory where app logs are stored
                                        (default: file:/tmp/spark-events)
     spark.history.fs.update.interval   How often to reload log data from 
storage
                                        (in seconds, default: 10)
   ```
   #### For error
   ```java
   Unrecognized options: --conf spark.history.ui.port=10000
   Usage: HistoryServer [options]
   
   Options:
     --properties-file FILE                                           Path to a 
custom Spark properties file.
                                                                      Default 
is conf/spark-defaults.conf.
   
   ```
   
   #### For help
   ```java
    sbin/start-history-server.sh --help
   Usage: ./sbin/start-history-server.sh [options]
   {"ts":"2024-05-30T07:15:29.740Z","level":"INFO","msg":"Registering signal 
handler for TERM","context":{"signal":"TERM"},"logger":"SignalUtils"}
   {"ts":"2024-05-30T07:15:29.741Z","level":"INFO","msg":"Registering signal 
handler for HUP","context":{"signal":"HUP"},"logger":"SignalUtils"}
   {"ts":"2024-05-30T07:15:29.741Z","level":"INFO","msg":"Registering signal 
handler for INT","context":{"signal":"INT"},"logger":"SignalUtils"}
   
   
   Options:
     --properties-file FILE                                           Path to a 
custom Spark properties file.
                                                                      Default 
is conf/spark-defaults.conf.
   
   Configuration options can be set by setting the corresponding JVM system 
property.
   History Server options are always available; additional options depend on 
the provider.
   
   History Server options:
     spark.history.custom.executor.log.url                            Specifies 
custom spark executor log url for supporting
                                                                      external 
log service instead of using cluster managers'
                                                                      
application log urls in the history server. Spark will
                                                                      support 
some path variables via patterns which can vary on
                                                                      cluster 
manager. Please check the documentation for your
                                                                      cluster 
manager to see which patterns are supported, if any.
                                                                      This 
configuration has no effect on a live application, it
                                                                      only 
affects the history server.
                                                                      (Default: 
<undefined>)
     spark.history.custom.executor.log.url.applyIncompleteApplication Whether 
to apply custom executor log url, as specified by
                                                                      
spark.history.custom.executor.log.url, to incomplete
                                                                      
application as well. Even if this is true, this still only
                                                                      affects 
the behavior of the history server, not running
                                                                      spark 
applications.
                                                                      (Default: 
true)
     spark.history.kerberos.enabled                                   Indicates 
whether the history server should use kerberos to
                                                                      login. 
This is required if the history server is accessing
                                                                      HDFS 
files on a secure Hadoop cluster.
                                                                      (Default: 
false)
     spark.history.kerberos.keytab                                    When 
spark.history.kerberos.enabled=true, specifies location
                                                                      of the 
kerberos keytab file for the History Server.
                                                                      (Default: 
<undefined>)
     spark.history.kerberos.principal                                 When 
spark.history.kerberos.enabled=true, specifies kerberos
                                                                      principal 
name for the History Server.
                                                                      (Default: 
<undefined>)
     spark.history.provider                                           Name of 
the class implementing the application history
                                                                      backend.
                                                                      (Default: 
org.apache.spark.deploy.history.FsHistoryProvider)
     spark.history.retainedApplications                               The 
number of applications to retain UI data for in the
                                                                      cache. If 
this cap is exceeded, then the oldest applications
                                                                      will be 
removed from the cache. If an application is not in
                                                                      the 
cache, it will have to be loaded from disk if it is
                                                                      accessed 
from the UI.
                                                                      (Default: 
50)
     spark.history.store.hybridStore.diskBackend                      Specifies 
a disk-based store used in hybrid store; ROCKSDB
                                                                      or 
LEVELDB (deprecated).
                                                                      (Default: 
ROCKSDB)
     spark.history.store.hybridStore.enabled                          Whether 
to use HybridStore as the store when parsing event
                                                                      logs. 
HybridStore will first write data to an in-memory
                                                                      store and 
having a background thread that dumps data to a
                                                                      disk 
store after the writing to in-memory store is
                                                                      completed.
                                                                      (Default: 
false)
     spark.history.store.hybridStore.maxMemoryUsage                   Maximum 
memory space that can be used to create HybridStore.
                                                                      The 
HybridStore co-uses the heap memory, so the heap memory
                                                                      should be 
increased through the memory option for SHS if the
                                                                      
HybridStore is enabled.
                                                                      (Default: 
2g)
     spark.history.store.maxDiskUsage                                 Maximum 
disk usage for the local directory where the cache
                                                                      
application history information are stored.
                                                                      (Default: 
10g)
     spark.history.store.path                                         Local 
directory where to cache application history
                                                                      
information. By default this is not set, meaning all history
                                                                      
information will be kept in memory.
                                                                      (Default: 
<undefined>)
     spark.history.store.serializer                                   
Serializer for writing/reading in-memory UI objects to/from
                                                                      
disk-based KV Store; JSON or PROTOBUF. JSON serializer is
                                                                      the only 
choice before Spark 3.4.0, thus it is the default
                                                                      value. 
PROTOBUF serializer is fast and compact, and it is
                                                                      the 
default serializer for disk-based KV store of live UI.
                                                                      (Default: 
JSON)
     spark.history.ui.acls.enable                                     Specifies 
whether ACLs should be checked to authorize users
                                                                      viewing 
the applications in the history server. If enabled,
                                                                      access 
control checks are performed regardless of what the
                                                                      
individual applications had set for spark.ui.acls.enable.
                                                                      The 
application owner will always have authorization to view
                                                                      their own 
application and any users specified via
                                                                      
spark.ui.view.acls and groups specified via
                                                                      
spark.ui.view.acls.groups when the application was run will
                                                                      also have 
authorization to view that application. If
                                                                      disabled, 
no access control checks are made for any
                                                                      
application UIs available through the history server.
                                                                      (Default: 
false)
     spark.history.ui.admin.acls                                      Comma 
separated list of users that have view access to all
                                                                      the Spark 
applications in history server.
                                                                      (Default: 
)
     spark.history.ui.admin.acls.groups                               Comma 
separated list of groups that have view access to all
                                                                      the Spark 
applications in history server.
                                                                      (Default: 
)
     spark.history.ui.port                                            Web UI 
port to bind Spark History Server
                                                                      (Default: 
18080)
   FsHistoryProvider options:
     spark.history.fs.cleaner.enabled                                 Whether 
the History Server should periodically clean up
                                                                      event 
logs from storage
                                                                      (Default: 
false)
     spark.history.fs.cleaner.interval                                When 
spark.history.fs.cleaner.enabled=true, specifies how
                                                                      often the 
filesystem job history cleaner checks for files to
                                                                      delete.
                                                                      (Default: 
1d)
     spark.history.fs.cleaner.maxAge                                  When 
spark.history.fs.cleaner.enabled=true, history files
                                                                      older 
than this will be deleted when the filesystem history
                                                                      cleaner 
runs.
                                                                      (Default: 
7d)
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this 
patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling 
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48471][CORE] Improve documentation and usage guide for history server [spark]

Reply via email to