[ https://issues.apache.org/jira/browse/SPARK-52336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013916#comment-18013916 ]
Shruti Singhania commented on SPARK-52336: ------------------------------------------ Hi team, I have opened a pull request to address this issue. You can find the changes here: [https://github.com/apache/spark/pull/52027] The patch modifies `{_}SparkHadoopUtil{_}` to automatically prepend a Spark-specific identifier to the Google Cloud Storage (GCS) connector's user agent, which is configured via the `{_}fs.gs.application.name.suffix{_}` Hadoop property. This approach ensures that we can identify requests from Spark applications for better telemetry and debugging. To address the implementation considerations mentioned in the ticket description, the logic is as follows: * If a user has already configured a value for `{_}fs.gs.application.name.suffix{_}`, the Spark identifier is prepended to the existing user-provided value. * If the property is not set or is empty, the Spark identifier is set as the default value. I've also added new unit tests in `{_}SparkHadoopUtilSuite{_}` to verify this behavior under different scenarios. > Set a default and prepend Spark identifier to GCS user-agent > ------------------------------------------------------------ > > Key: SPARK-52336 > URL: https://issues.apache.org/jira/browse/SPARK-52336 > Project: Spark > Issue Type: Task > Components: Spark Core > Affects Versions: 3.5.6, 4.0.1 > Reporter: Shruti Singhania > Priority: Minor > Labels: Google, configuration, gcs, hadoop-conf, > pull-request-available > > *1. Current Behavior:* > Apache Spark does not currently set a default value for the GCS Hadoop > connector configuration {{{}fs.gs.application.name.suffix{}}}. Users who want > to leverage this GCS connector feature for better traceability of Spark > applications in GCS logs and metrics must set it explicitly either in Hadoop > configuration files ({{{}core-site.xml{}}}), via {{{}spark-submit --conf{}}}, > or programmatically in their Spark application. > *2. Problem / Motivation:* > The {{fs.gs.application.name.suffix}} property is very useful for identifying > which application is performing GCS operations, especially in environments > where multiple Spark applications (or other Hadoop applications) interact > with GCS concurrently. > Without a default set by Spark when GCS is used: > * Many users might be unaware of this beneficial GCS connector feature. > * GCS logs and metrics is getting harder to correlate with specific Spark > applications, increasing debugging time and operational overhead. > * It introduces an extra configuration step for users who would benefit from > this tagging. > Setting a sensible default when GCS is detected would improve the experience > for Spark users on GCS, providing better traceability with no extra > configuration effort for the common case. > *3. Proposed Change:* > We propose that Spark should automatically set a default value for > {{fs.gs.application.name.suffix}} if: > # The application is interacting with Google Cloud Storage (i.e., paths with > the {{gs://}} scheme are used). > # The user has *not* already provided a value for > {{fs.gs.application.name.suffix}} in their Hadoop configuration or Spark > configuration. User-defined values should always take precedence. > *4. Implementation Details* {*}(Open for Discussion){*}{*}:{*} > The implementation modifies {{SparkHadoopUtil}} to automatically prepend a > Spark-specific identifier to the Google Cloud Storage (GCS) connector's user > agent. The user agent is configured via the {{fs.gs.application.name.suffix}} > Hadoop property. > If a user has already configured a suffix, the Spark identifier is prepended > to the existing user-provided value. Otherwise, the Spark identifier is set > as the default. > The Spark identifier is in the format: {{apache_spark/SPARK_VERSION > (GPN:apache_spark)}} > *5. Benefits:* > * *Improved Traceability:* Easier to identify Spark application interactions > in GCS request logs and metrics provided by the GCS connector. > * *Enhanced Debugging:* Simplifies pinpointing GCS operations related to > specific Spark jobs. > * *Better User Experience:* Provides a useful GCS integration feature by > default, reducing boilerplate configuration for users. > * *Consistency:* Encourages a good practice for applications interacting > with GCS. > > *Impact:* This change is expected to be low-impact and beneficial. It adds a > configuration property that the GCS connector already understands. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org