Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/16294#discussion_r93718420
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -763,16 +881,78 @@ returned through `Dataset.writeStream()`. You will
have to specify one or more o
- *Checkpoint location:* For some output sinks where the end-to-end
fault-tolerance can be guaranteed, specify the location where the system will
write all the checkpoint information. This should be a directory in an
HDFS-compatible fault-tolerant file system. The semantics of checkpointing is
discussed in more detail in the next section.
#### Output Modes
-There are two types of output mode currently implemented.
+There are a few types of output modes.
+
+- **Append mode (default)** - This is the default mode, where only the
+new rows added to the Result Table since the last trigger will be
+outputted to the sink. This is supported for only those queries where
+rows added to the Result Table is never going to change. Hence, this mode
+guarantees that each row will be output only once (assuming
+fault-tolerant sink). For example, queries with only `select`,
+`where`, `map`, `flatMap`, `filter`, `join`, etc. will support Append mode.
-- **Append mode (default)** - This is the default mode, where only the new
rows added to the result table since the last trigger will be outputted to the
sink. This is only applicable to queries that *do not have any aggregations*
(e.g. queries with only `select`, `where`, `map`, `flatMap`, `filter`, `join`,
etc.).
+- **Complete mode** - The whole Result Table will be outputted to the sink
after every trigger.
+ This is supported for aggregation queries.
-- **Complete mode** - The whole result table will be outputted to the
sink.This is only applicable to queries that *have aggregations*.
+- **Update mode** - (*not available in Spark 2.1*) Only the rows in the
Result Table that were
+updated since the last trigger will be outputted to the sink.
+More information to be added in future releases.
+
+Different types of streaming queries support different output modes.
+Here is the compatibility matrix.
+
+<table class="table">
+ <tr>
+ <th>Query Type</th>
+ <th></th>
+ <th>Supported Output Modes</th>
+ <th>Notes</th>
+ </tr>
+ <tr>
+ <td colspan="2" valign="middle"><br/>Queries without aggregation</td>
+ <td>Append</td>
+ <td>
+ Complete mode note supported as it is infeasible to keep all data
in the Result Table.
+ </td>
+ </tr>
+ <tr>
+ <td rowspan="2">Queries with aggregation</td>
+ <td>Aggregation on event-time with watermark</td>
+ <td>Append, Complete</td>
+ <td>
+ Append mode uses watermark to drop old aggregation state. But the
output of a
+ windowed aggregation is delayed the late threshold specified in
`withWatermark()` as by
+ the modes semantics, rows can be added to the Result Table only
once after they are
+ finalized (i.e. after watermark is crossed). See
+ <a href="#handling-late-data">Late Data</a> section for more
details.
+ <br/><br/>
+ Complete mode does drop not old aggregation state since by
definition this mode
+ preserves all data in the Result Table.
+ </td>
+ </tr>
+ <tr>
+ <td>Other aggregations</td>
+ <td>Complete</td>
+ <td>
+ Append mode is not supported as aggregates can update thus
violating the semantics of
+ this mode.
+ <br/><br/>
+ Complete mode does drop not old aggregation state since by
definition this mode
+ preserves all data in the Result Table.
+ </td>
+ </tr>
+ <tr>
--- End diff --
same reason as the other place.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]