HeartSaVioR commented on a change in pull request #23840: [SPARK-24295][SS] Add
option to retain only last batch in file stream sink metadata
URL: https://github.com/apache/spark/pull/23840#discussion_r258317342
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1812,6 +1817,12 @@ Here are the details of all the sinks in Spark.
(<a
href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a
href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a
href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a
href="api/R/write.stream.html">R</a>).
E.g. for "parquet" format options see
<code>DataFrameWriter.parquet()</code>
+ <br/>
+ <code>retainOnlyLastBatchInMetadata</code>: whether to retain metadata
information only for last succeed batch.
+ <br/><br/>
+ This option greatly reduces overhead on compacting metadata files
which would be non-trivial when query processes lots of files in each
batch.<br/>
+ NOTE: As it only retains the last batch in metadata, the metadata is
not readable from file source: you must set "ignoreFileStreamSinkMetadata"
option
Review comment:
I feel this is not ideal, but given file stream sink itself also leverages
file log, it cannot be an optional entirely. If we would like to not leaving
file log in this case, we may need to have another metadata (which store
minimized information like the last succeed batch id) and store it instead when
the option is turned on.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]