HeartSaVioR commented on a change in pull request #24128: [SPARK-27188][SS]
FileStreamSink: provide a new option to disable metadata log
URL: https://github.com/apache/spark/pull/24128#discussion_r266378256
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1807,13 +1807,19 @@ Here are the details of all the sinks in Spark.
<td>Append</td>
<td>
<code>path</code>: path to the output directory, must be specified.
+ <br/>
+ <code>disableMetadata</code>: whether to disable metadata log files or
not (default: false)
+ Metadata log is growing incrementally while running streaming query
which affect query execution time as well as disk space.
+ Disabling metadata log greatly helps on remedying the impact, but it
changes fault-tolerance guarantee of FileStreamSink to
Review comment:
http://mail-archives.apache.org/mod_mbox/spark-user/201706.mbox/%3cca+ahukmddrcr4zjvxoaax+9qjlfmhf7xy1qyul3vh2q3zia...@mail.gmail.com%3E
I found relevant thread in the mailing list which @tdas replied relevant
question. Quoting here.
```
> · The Query creates an additional dir “_spark_metadata
> <https://console.aws.amazon.com/s3/>” under the Destination dir and this
> causes the select statements against the Parquet table fail as it is
> expecting only the parquet files under the destination location. Is there a
> config to avoid the creation of this dir?
The _spark_metadata directory hold the metadata information (a
write-ahead-log) of which files in the directory are the correct complete
files generated by the streaming query. This is how we actually get exactly
once guarantees when writing to a file system. So this directory is very
important. In fact, if you want to run queries on this directory, Spark is
aware of this metadata and will read the correct files, and ignore
temporary/incomplete files that may have been generated by failed/duplicate
tasks. Querying that directory from Hive may lead to duplicate data as it
will read those temp/duplicate files as well.
```
My understanding for non-metadata commit protocol is writing all files to
stage directory in each task, and let driver move files in stage directory to
output directory, so "kind-of" exactly-once, but I also think the trick can be
also broken for some possible cases so safer to say it's 'at-least-once'
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]