[ 
https://issues.apache.org/jira/browse/SPARK-52336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52336:
-----------------------------------
    Labels: Google configuration gcs hadoop-conf pull-request-available  (was: 
Google configuration gcs hadoop-conf)

> Set a default and prepend Spark identifier to GCS user-agent
> ------------------------------------------------------------
>
>                 Key: SPARK-52336
>                 URL: https://issues.apache.org/jira/browse/SPARK-52336
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 3.5.6, 4.0.1
>            Reporter: Shruti Singhania
>            Priority: Minor
>              Labels: Google, configuration, gcs, hadoop-conf, 
> pull-request-available
>
> *1. Current Behavior:*
> Apache Spark does not currently set a default value for the GCS Hadoop 
> connector configuration {{{}fs.gs.application.name.suffix{}}}. Users who want 
> to leverage this GCS connector feature for better traceability of Spark 
> applications in GCS logs and metrics must set it explicitly either in Hadoop 
> configuration files ({{{}core-site.xml{}}}), via {{{}spark-submit --conf{}}}, 
> or programmatically in their Spark application.
> *2. Problem / Motivation:*
> The {{fs.gs.application.name.suffix}} property is very useful for identifying 
> which application is performing GCS operations, especially in environments 
> where multiple Spark applications (or other Hadoop applications) interact 
> with GCS concurrently.
> Without a default set by Spark when GCS is used:
>  * Many users might be unaware of this beneficial GCS connector feature.
>  * GCS logs and metrics is getting harder to correlate with specific Spark 
> applications, increasing debugging time and operational overhead.
>  * It introduces an extra configuration step for users who would benefit from 
> this tagging.
> Setting a sensible default when GCS is detected would improve the experience 
> for Spark users on GCS, providing better traceability with no extra 
> configuration effort for the common case.
> *3. Proposed Change:*
> We propose that Spark should automatically set a default value for 
> {{fs.gs.application.name.suffix}} if:
>  # The application is interacting with Google Cloud Storage (i.e., paths with 
> the {{gs://}} scheme are used).
>  # The user has *not* already provided a value for 
> {{fs.gs.application.name.suffix}} in their Hadoop configuration or Spark 
> configuration. User-defined values should always take precedence.
> *4. Implementation Details* {*}(Open for Discussion){*}{*}:{*}
> The implementation modifies {{SparkHadoopUtil}} to automatically prepend a 
> Spark-specific identifier to the Google Cloud Storage (GCS) connector's user 
> agent. The user agent is configured via the {{fs.gs.application.name.suffix}} 
> Hadoop property.
> If a user has already configured a suffix, the Spark identifier is prepended 
> to the existing user-provided value. Otherwise, the Spark identifier is set 
> as the default.
> The Spark identifier is in the format: {{apache_spark/SPARK_VERSION 
> (GPN:apache_spark)}}
> *5. Benefits:*
>  * *Improved Traceability:* Easier to identify Spark application interactions 
> in GCS request logs and metrics provided by the GCS connector.
>  * *Enhanced Debugging:* Simplifies pinpointing GCS operations related to 
> specific Spark jobs.
>  * *Better User Experience:* Provides a useful GCS integration feature by 
> default, reducing boilerplate configuration for users.
>  * *Consistency:* Encourages a good practice for applications interacting 
> with GCS.
>  
> *Impact:* This change is expected to be low-impact and beneficial. It adds a 
> configuration property that the GCS connector already understands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to