[GitHub] HeartSaVioR commented on a change in pull request #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata

GitBox Tue, 19 Feb 2019 19:03:36 -0800

HeartSaVioR commented on a change in pull request #23840: [SPARK-24295][SS] Add 
option to retain only last batch in file stream sink metadata
URL: https://github.com/apache/spark/pull/23840#discussion_r258317342


 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -1812,6 +1817,12 @@ Here are the details of all the sinks in Spark.
         (<a 
href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a 
href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a 
href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a
         href="api/R/write.stream.html">R</a>).
         E.g. for "parquet" format options see 
<code>DataFrameWriter.parquet()</code>
+        <br/>
+        <code>retainOnlyLastBatchInMetadata</code>: whether to retain metadata 
information only for last succeed batch.
+        <br/><br/>
+        This option greatly reduces overhead on compacting metadata files 
which would be non-trivial when query processes lots of files in each 
batch.<br/>
+        NOTE: As it only retains the last batch in metadata, the metadata is 
not readable from file source: you must set "ignoreFileStreamSinkMetadata" 
option
 
 Review comment:
   I feel this is not ideal, but given file stream sink itself also leverages 
file log, it cannot be an optional entirely. If we would like to not leaving 
file log in this case, we may need to have another metadata (which store 
minimized information like the last succeed batch id) and store it instead when 
the option is turned on.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] HeartSaVioR commented on a change in pull request #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata

Reply via email to