[ https://issues.apache.org/jira/browse/SPARK-52336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-52336: ----------------------------------- Labels: Google configuration gcs hadoop-conf pull-request-available (was: Google configuration gcs hadoop-conf) > Set a default and prepend Spark identifier to GCS user-agent > ------------------------------------------------------------ > > Key: SPARK-52336 > URL: https://issues.apache.org/jira/browse/SPARK-52336 > Project: Spark > Issue Type: Task > Components: Spark Core > Affects Versions: 3.5.6, 4.0.1 > Reporter: Shruti Singhania > Priority: Minor > Labels: Google, configuration, gcs, hadoop-conf, > pull-request-available > > *1. Current Behavior:* > Apache Spark does not currently set a default value for the GCS Hadoop > connector configuration {{{}fs.gs.application.name.suffix{}}}. Users who want > to leverage this GCS connector feature for better traceability of Spark > applications in GCS logs and metrics must set it explicitly either in Hadoop > configuration files ({{{}core-site.xml{}}}), via {{{}spark-submit --conf{}}}, > or programmatically in their Spark application. > *2. Problem / Motivation:* > The {{fs.gs.application.name.suffix}} property is very useful for identifying > which application is performing GCS operations, especially in environments > where multiple Spark applications (or other Hadoop applications) interact > with GCS concurrently. > Without a default set by Spark when GCS is used: > * Many users might be unaware of this beneficial GCS connector feature. > * GCS logs and metrics is getting harder to correlate with specific Spark > applications, increasing debugging time and operational overhead. > * It introduces an extra configuration step for users who would benefit from > this tagging. > Setting a sensible default when GCS is detected would improve the experience > for Spark users on GCS, providing better traceability with no extra > configuration effort for the common case. > *3. Proposed Change:* > We propose that Spark should automatically set a default value for > {{fs.gs.application.name.suffix}} if: > # The application is interacting with Google Cloud Storage (i.e., paths with > the {{gs://}} scheme are used). > # The user has *not* already provided a value for > {{fs.gs.application.name.suffix}} in their Hadoop configuration or Spark > configuration. User-defined values should always take precedence. > *4. Implementation Details* {*}(Open for Discussion){*}{*}:{*} > The implementation modifies {{SparkHadoopUtil}} to automatically prepend a > Spark-specific identifier to the Google Cloud Storage (GCS) connector's user > agent. The user agent is configured via the {{fs.gs.application.name.suffix}} > Hadoop property. > If a user has already configured a suffix, the Spark identifier is prepended > to the existing user-provided value. Otherwise, the Spark identifier is set > as the default. > The Spark identifier is in the format: {{apache_spark/SPARK_VERSION > (GPN:apache_spark)}} > *5. Benefits:* > * *Improved Traceability:* Easier to identify Spark application interactions > in GCS request logs and metrics provided by the GCS connector. > * *Enhanced Debugging:* Simplifies pinpointing GCS operations related to > specific Spark jobs. > * *Better User Experience:* Provides a useful GCS integration feature by > default, reducing boilerplate configuration for users. > * *Consistency:* Encourages a good practice for applications interacting > with GCS. > > *Impact:* This change is expected to be low-impact and beneficial. It adds a > configuration property that the GCS connector already understands. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org