[GitHub] spark pull request #14803: [SPARK-17153][SQL] Should read partition data whe...

marmbrus Thu, 22 Sep 2016 12:41:52 -0700

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14803#discussion_r80120376
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -512,6 +512,10 @@ csvDF = spark \
     
     These examples generate streaming DataFrames that are untyped, meaning 
that the schema of the DataFrame is not checked at compile time, only checked 
at runtime when the query is submitted. Some operations like `map`, `flatMap`, 
etc. need the type to be known at compile time. To do those, you can convert 
these untyped streaming DataFrames to typed streaming Datasets using the same 
methods as static DataFrame. See the [SQL Programming 
Guide](sql-programming-guide.html) for more details. Additionally, more details 
on the supported streaming sources are discussed later in the document.
     
    +### Schema inference and partition of streaming DataFrames/Datasets
    +
    +You can specify the schema for streaming DataFrames/Datasets to create 
with the API as shown in above example (i.e., `userSchema`). Alternatively, for 
file-based streaming source, you can config it to infer the schema. By default, 
the configure of streaming schema inference 
`spark.sql.streaming.schemaInference` is turned off. If the streaming 
DataFrame/Dataset is partitioned, the partition columns will only be inferred 
if the partition directories are present when the stream starts. When schema 
inference is turned off, for all file-based streaming sources except for `text` 
format, you have to include partition columns in the user provided schema.
    --- End diff --
    
    By default, Structured Streaming from file based sources requires you to 
specify the schema, rather than relying on Spark to infer it automatically.  
This restriction ensures a consistent schema will be used for the streaming 
query, even in the case of failures.  For ad-hoc use cases, you can reenable 
schema inference by setting `spark.sql.streaming.schemaInference` to `true`.
    
    Partition discovery does occur for subdirectories that are named 
`/key=value/` and listing will automatically recurse into these directories.  
If these columns appear in the user provided schema, they will be filled in by 
Spark based on the path of the file being read.  The directories that make up 
the partitioning scheme must be present when the query starts and must remain 
static.  For example, it is okay to add `/data/year=2016/` when 
`/data/year=2015/` was present, but it is invalid to change the partitioning 
column (i.e. by creating the directory `/data/date=2016-04-17/`).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14803: [SPARK-17153][SQL] Should read partition data whe...

Reply via email to