Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/14803#discussion_r80120376
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -512,6 +512,10 @@ csvDF = spark \
These examples generate streaming DataFrames that are untyped, meaning
that the schema of the DataFrame is not checked at compile time, only checked
at runtime when the query is submitted. Some operations like `map`, `flatMap`,
etc. need the type to be known at compile time. To do those, you can convert
these untyped streaming DataFrames to typed streaming Datasets using the same
methods as static DataFrame. See the [SQL Programming
Guide](sql-programming-guide.html) for more details. Additionally, more details
on the supported streaming sources are discussed later in the document.
+### Schema inference and partition of streaming DataFrames/Datasets
+
+You can specify the schema for streaming DataFrames/Datasets to create
with the API as shown in above example (i.e., `userSchema`). Alternatively, for
file-based streaming source, you can config it to infer the schema. By default,
the configure of streaming schema inference
`spark.sql.streaming.schemaInference` is turned off. If the streaming
DataFrame/Dataset is partitioned, the partition columns will only be inferred
if the partition directories are present when the stream starts. When schema
inference is turned off, for all file-based streaming sources except for `text`
format, you have to include partition columns in the user provided schema.
--- End diff --
By default, Structured Streaming from file based sources requires you to
specify the schema, rather than relying on Spark to infer it automatically.
This restriction ensures a consistent schema will be used for the streaming
query, even in the case of failures. For ad-hoc use cases, you can reenable
schema inference by setting `spark.sql.streaming.schemaInference` to `true`.
Partition discovery does occur for subdirectories that are named
`/key=value/` and listing will automatically recurse into these directories.
If these columns appear in the user provided schema, they will be filled in by
Spark based on the path of the file being read. The directories that make up
the partitioning scheme must be present when the query starts and must remain
static. For example, it is okay to add `/data/year=2016/` when
`/data/year=2015/` was present, but it is invalid to change the partitioning
column (i.e. by creating the directory `/data/date=2016-04-17/`).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]