[
https://issues.apache.org/jira/browse/SPARK-52336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shruti Singhania updated SPARK-52336:
-------------------------------------
Summary: Set a default and prepend Spark identifier to GCS user-agent
(was: Set a default for fs.gs.application.name.suffix when GCS is used and not
user-defined)
> Set a default and prepend Spark identifier to GCS user-agent
> ------------------------------------------------------------
>
> Key: SPARK-52336
> URL: https://issues.apache.org/jira/browse/SPARK-52336
> Project: Spark
> Issue Type: Task
> Components: Spark Core
> Affects Versions: 3.5.6, 4.0.1
> Reporter: Shruti Singhania
> Priority: Minor
> Labels: Google, configuration, gcs, hadoop-conf
>
> *1. Current Behavior:*
> Apache Spark does not currently set a default value for the GCS Hadoop
> connector configuration {{{}fs.gs.application.name.suffix{}}}. Users who want
> to leverage this GCS connector feature for better traceability of Spark
> applications in GCS logs and metrics must set it explicitly either in Hadoop
> configuration files ({{{}core-site.xml{}}}), via {{{}spark-submit --conf{}}},
> or programmatically in their Spark application.
> *2. Problem / Motivation:*
> The {{fs.gs.application.name.suffix}} property is very useful for identifying
> which application is performing GCS operations, especially in environments
> where multiple Spark applications (or other Hadoop applications) interact
> with GCS concurrently.
> Without a default set by Spark when GCS is used:
> * Many users might be unaware of this beneficial GCS connector feature.
> * GCS logs and metrics is getting harder to correlate with specific Spark
> applications, increasing debugging time and operational overhead.
> * It introduces an extra configuration step for users who would benefit from
> this tagging.
> Setting a sensible default when GCS is detected would improve the experience
> for Spark users on GCS, providing better traceability with no extra
> configuration effort for the common case.
> *3. Proposed Change:*
> We propose that Spark should automatically set a default value for
> {{fs.gs.application.name.suffix}} if:
> # The application is interacting with Google Cloud Storage (i.e., paths with
> the {{gs://}} scheme are used).
> # The user has *not* already provided a value for
> {{fs.gs.application.name.suffix}} in their Hadoop configuration or Spark
> configuration. User-defined values should always take precedence.
> *4. Implementation Considerations* {*}(Open for Discussion){*}{*}:{*}
> * *Detection of GCS Usage:* Spark would need to detect when a {{FileSystem}}
> for the {{gs://}} scheme is being initialized or used. This might be done in
> {{HadoopFSUtils}} or a similar place where the Hadoop {{Configuration}}
> object is prepared for file system interactions.
> * *Precedence:* The logic must ensure that this default is only applied if
> {{fs.gs.application.name.suffix}} (and potentially {{fs.gs.application.name}}
> if the suffix is intended to be appended to it by the connector) is not
> already present in the Hadoop {{Configuration}} being used.{*}{*}
> *5. Benefits:*
> * *Improved Traceability:* Easier to identify Spark application interactions
> in GCS request logs and metrics provided by the GCS connector.
> * *Enhanced Debugging:* Simplifies pinpointing GCS operations related to
> specific Spark jobs.
> * *Better User Experience:* Provides a useful GCS integration feature by
> default, reducing boilerplate configuration for users.
> * *Consistency:* Encourages a good practice for applications interacting
> with GCS.
>
> *Impact:* This change is expected to be low-impact and beneficial. It adds a
> configuration property that the GCS connector already understands.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]