[GitHub] spark pull request #14234: [MINOR][SQL][STREAMING][DOCS] Fix minor typos, pu...

ahmed-mahran Sat, 16 Jul 2016 17:15:07 -0700

Github user ahmed-mahran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14234#discussion_r71073746
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -410,26 +398,21 @@ see how this model handles event-time based 
processing and late arriving data.
     ## Handling Event-time and Late Data
     Event-time is the time embedded in the data itself. For many applications, 
you may want to operate on this event-time. For example, if you want to get the 
number of events generated by IoT devices every minute, then you probably want 
to use the time when the data was generated (that is, event-time in the data), 
rather than the time Spark receives them. This event-time is very naturally 
expressed in this model -- each event from the devices is a row in the table, 
and event-time is a column value in the row. This allows window-based 
aggregations (e.g. number of event every minute) to be just a special type of 
grouping and aggregation on the even-time column -- each time window is a group 
and each row can belong to multiple windows/groups. Therefore, such 
event-time-window-based aggregation queries can be defined consistently on both 
a static dataset (e.g. from collected device events logs) as well as on a data 
stream, making the life of the user much easier.
     
    -Furthermore this model naturally handles data that has arrived later than 
expected based on its event-time. Since Spark is updating the Result Table, it 
has full control over updating/cleaning up the aggregates when there is late 
data. While not yet implemented in Spark 2.0, event-time watermarking will be 
used to manage this data. These are explained later in more details in the 
[Window Operations](#window-operations-on-event-time) section.
    +Furthermore, this model naturally handles data that has arrived later than 
expected based on its event-time. Since Spark is updating the Result Table, it 
has full control over updating/cleaning up the aggregates when there is late 
data. While not yet implemented in Spark 2.0, event-time watermarking will be 
used to manage this data. These are explained later in more details in the 
[Window Operations](#window-operations-on-event-time) section.
     
     ## Fault Tolerance Semantics
     Delivering end-to-end exactly-once semantics was one of key goals behind 
the design of Structured Streaming. To achieve that, we have designed the 
Structured Streaming sources, the sinks and the execution engine to reliably 
track the exact progress of the processing so that it can handle any kind of 
failure by restarting and/or reprocessing. Every streaming source is assumed to 
have offsets (similar to Kafka offsets, or Kinesis sequence numbers)
     to track the read position in the stream. The engine uses checkpointing 
and write ahead logs to record the offset range of the data being processed in 
each trigger. The streaming sinks are designed to be idempotent for handling 
reprocessing. Together, using replayable sources and idempotant sinks, 
Structured Streaming can ensure **end-to-end exactly-once semantics** under any 
failure.
     
     # API using Datasets and DataFrames
    -Since Spark 2.0, DataFrames and Datasets can represent static, bounded 
data, as well as streaming, unbounded data. Similar to static 
Datasets/DataFrames, you can use the common entry point `SparkSession` (
    -[Scala](api/scala/index.html#org.apache.spark.sql.SparkSession)/
    -[Java](api/java/org/apache/spark/sql/SparkSession.html)/
    -[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession) docs) to 
create streaming DataFrames/Datasets from streaming sources, and apply the same 
operations on them as static DataFrames/Datasets. If you are not familiar with 
Datasets/DataFrames, you are strongly advised to familiarize yourself with them 
using the 
    +Since Spark 2.0, DataFrames and Datasets can represent static, bounded 
data, as well as streaming, unbounded data. Similar to static 
Datasets/DataFrames, you can use the common entry point `SparkSession` 
([Scala](api/scala/index.html#org.apache.spark.sql.SparkSession)/[Java](api/java/org/apache/spark/sql/SparkSession.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession)
 docs) to create streaming DataFrames/Datasets from streaming sources, and 
apply the same operations on them as static DataFrames/Datasets. If you are not 
familiar with Datasets/DataFrames, you are strongly advised to familiarize 
yourself with them using the 
     [DataFrame/Dataset Programming Guide](sql-programming-guide.html).
     
     ## Creating streaming DataFrames and streaming Datasets
     Streaming DataFrames can be created through the `DataStreamReader` 
interface 
    
-([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader)/
    -[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/
    
-[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader) 
docs) returned by `SparkSession.readStream()`. Similar to the read interface 
for creating static DataFrame, you can specify the details of the source - data 
format, schema, options, etc. In Spark 2.0, there are a few built-in sources.
    
+([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader)
 docs) returned by `SparkSession.readStream()`. Similar to the read interface 
for creating static DataFrame, you can specify the details of the source â 
data format, schema, options, etc. In Spark 2.0, there are a few built-in 
sources.
     
    -  - **File sources** - Reads files written in a directory as a stream of 
data. Supported file formats are text, csv, json, parquet. See the docs of the 
DataStreamReader interface for a more up-to-date list, and supported options 
for each file format. Note that the files must be atomically placed in the 
given directory, which in most file systems, can be achieved by file move 
operations.
    --- End diff --
    
    Singular is the convention used around



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14234: [MINOR][SQL][STREAMING][DOCS] Fix minor typos, pu...

Reply via email to