[GitHub] [spark] HeartSaVioR commented on a change in pull request #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to disable metadata log

GitBox Mon, 18 Mar 2019 03:46:02 -0700

HeartSaVioR commented on a change in pull request #24128: [SPARK-27188][SS] 
FileStreamSink: provide a new option to disable metadata log
URL: https://github.com/apache/spark/pull/24128#discussion_r266378256


 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -1807,13 +1807,19 @@ Here are the details of all the sinks in Spark.
     <td>Append</td>
     <td>
         <code>path</code>: path to the output directory, must be specified.
+        <br/>
+        <code>disableMetadata</code>: whether to disable metadata log files or 
not (default: false)
+        Metadata log is growing incrementally while running streaming query 
which affect query execution time as well as disk space.
+        Disabling metadata log greatly helps on remedying the impact, but it 
changes fault-tolerance guarantee of FileStreamSink to
 
 Review comment:
   
http://mail-archives.apache.org/mod_mbox/spark-user/201706.mbox/%3cca+ahukmddrcr4zjvxoaax+9qjlfmhf7xy1qyul3vh2q3zia...@mail.gmail.com%3E
   
   I found relevant thread in the mailing list which @tdas replied relevant 
question. Quoting here.
   
   ```
   > ·         The Query creates an additional dir “_spark_metadata
   > <https://console.aws.amazon.com/s3/>” under the Destination dir and this
   > causes the select statements against the Parquet table fail as it is
   > expecting only the parquet files under the destination location. Is there a
   > config to avoid the creation of this dir?
   
   The _spark_metadata directory hold the metadata information (a
   write-ahead-log) of which files in the directory are the correct complete
   files generated by the streaming query. This is how we actually get exactly
   once guarantees when writing to a file system. So this directory is very
   important. In fact, if you want to run queries on this directory, Spark is
   aware of this metadata and will read the correct files, and ignore
   temporary/incomplete files that may have been generated by failed/duplicate
   tasks. Querying that directory from Hive may lead to duplicate data as it
   will read those temp/duplicate files as well.
   ```
   
   My understanding for non-metadata commit protocol is writing all files to 
stage directory in each task, and let driver move files in stage directory to 
output directory, so "kind-of" exactly-once, but I also think the trick can be 
also broken for some possible cases so safer to say it's 'at-least-once'

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to disable metadata log

Reply via email to