[jira] [Updated] (SPARK-43152) Parametrisable output metadata path (_spark_metadata)

Wojciech Indyk (Jira) Sat, 15 Apr 2023 11:06:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-43152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wojciech Indyk updated SPARK-43152:
-----------------------------------
    Description: 
Currently path of metadata of output checkpoint is hardcoded. The metadata is 
saved in output path in _spark_metadata folder. It's a constraint on structure 
of paths, that might be easily relaxed by parametrisable path of output 
metadata. It would help with issues like [changing output directory of spark 
streaming job|https://kb.databricks.com/en_US/streaming/file-sink-streaming], 
[two jobs writing to the same output 
path|https://issues.apache.org/jira/browse/SPARK-30542] or [partition 
discovery|https://stackoverflow.com/questions/61904732/is-it-possible-to-change-location-of-spark-metadata-folder-in-spark-structured/61905158].
 It would also help with separation of metadata from data in path structure.

The main target of change is getMetadataLogPath method in FileStreamSink. It 
has got access to sqlConf, so this method can override the default 
_spark_metadata path if defined it config. Introduction of parametrised 
metadata path needs reconsidering of meaning of  hasMetadata method in 
FileStreamSink.

  was:
Currently path of metadata of output checkpoint is hardcoded. The metadata is 
saved in output path in _spark_metadata folder. It's a constraint on structure 
of paths, that might be easily relaxed by parametrisable path of output 
metadata. It would help with issues like [changing output directory of spark 
streaming job|https://kb.databricks.com/en_US/streaming/file-sink-streaming], 
two jobs writing to the same output path or [partition 
discovery|https://stackoverflow.com/questions/61904732/is-it-possible-to-change-location-of-spark-metadata-folder-in-spark-structured/61905158#61905158].
 It would also help with separation of metadata from data in path structure.

The main target of change is getMetadataLogPath method in FileStreamSink. It 
has got access to sqlConf, so this method can override the default 
_spark_metadata path if defined it config. Introduction of parametrised 
metadata path needs reconsidering of meaning of  hasMetadata method in 
FileStreamSink.


> Parametrisable output metadata path (_spark_metadata)
> -----------------------------------------------------
>
>                 Key: SPARK-43152
>                 URL: https://issues.apache.org/jira/browse/SPARK-43152
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.4.0
>            Reporter: Wojciech Indyk
>            Priority: Major
>
> Currently path of metadata of output checkpoint is hardcoded. The metadata is 
> saved in output path in _spark_metadata folder. It's a constraint on 
> structure of paths, that might be easily relaxed by parametrisable path of 
> output metadata. It would help with issues like [changing output directory of 
> spark streaming 
> job|https://kb.databricks.com/en_US/streaming/file-sink-streaming], [two jobs 
> writing to the same output 
> path|https://issues.apache.org/jira/browse/SPARK-30542] or [partition 
> discovery|https://stackoverflow.com/questions/61904732/is-it-possible-to-change-location-of-spark-metadata-folder-in-spark-structured/61905158].
>  It would also help with separation of metadata from data in path structure.
> The main target of change is getMetadataLogPath method in FileStreamSink. It 
> has got access to sqlConf, so this method can override the default 
> _spark_metadata path if defined it config. Introduction of parametrised 
> metadata path needs reconsidering of meaning of  hasMetadata method in 
> FileStreamSink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43152) Parametrisable output metadata path (_spark_metadata)

Reply via email to