Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22121#discussion_r211986779
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,377 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides
built-in support for reading and writing Apache Avro data.
+
+## Deploying
+The `spark-avro` module is external and not included in `spark-submit` or
`spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your
application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using
`--packages`, such as,
+
+ ./bin/spark-submit --packages
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
...
+
+For experimenting on `spark-shell`, you can also use `--packages` to add
`org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its
dependencies directly,
+
+ ./bin/spark-shell --packages
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
...
+
+See [Application Submission Guide](submitting-applications.html) for more
details about submitting applications with external dependencies.
+
+## Load and Save Functions
+
+Since `spark-avro` module is external, there is no `.avro` API in
+`DataFrameReader` or `DataFrameWriter`.
+
+To load/save data in Avro format, you need to specify the data source
option `format` as `avro`(or `org.apache.spark.sql.avro`).
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+
+val usersDF =
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+usersDF.select("name",
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+</div>
+<div data-lang="java" markdown="1">
+{% highlight java %}
+
+Dataset<Row> usersDF =
spark.read().format("avro").load("examples/src/main/resources/users.avro");
+usersDF.select("name",
"favorite_color").write().format("avro").save("namesAndFavColors.avro");
+
+{% endhighlight %}
+</div>
+<div data-lang="python" markdown="1">
+{% highlight python %}
+
+df =
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+df.select("name",
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+</div>
+<div data-lang="r" markdown="1">
+{% highlight r %}
+
+df <- read.df("examples/src/main/resources/users.avro", "avro")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro",
"avro")
+
+{% endhighlight %}
+</div>
+</div>
+
+## to_avro() and from_avro()
+Spark SQL provides function `to_avro` to encode a struct as a string and
`from_avro()` to retrieve the struct as a complex type.
+
+Using Avro record as columns are useful when reading from or writing to a
streaming source like Kafka. Each
+Kafka key-value record will be augmented with some metadata, such as the
ingestion timestamp into Kafka, the offset in Kafka, etc.
+* If the "value" field that contains your data is in Avro, you could use
`from_avro()` to extract your data, enrich it, clean it, and then push it
downstream to Kafka again or write it out to a file.
+* `to_avro()` can be used to turn structs into Avro records. This method
is particularly useful when you would like to re-encode multiple columns into a
single one when writing data out to Kafka.
+
+Both methods are presently only available in Scala and Java.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+import org.apache.spark.sql.avro._
+
+// `from_avro` requires Avro schema in JSON string format.
+val jsonFormatSchema = new
String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
+
+val df = spark
+ .readStream
+ .format("kafka")
+ .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+ .option("subscribe", "topic1")
+ .load()
+
+// 1. Decode the Avro data into a struct;
+// 2. Filter by column `favorite_color`;
+// 3. Encode the column `name` in Avro format.
+val output = df
+ .select(from_avro('value, jsonFormatSchema) as 'user)
+ .where("user.favorite_color == \"red\"")
+ .select(to_avro($"user.name") as 'value)
+
+val ds = output
+ .writeStream
+ .format("kafka")
+ .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+ .option("topic", "topic2")
+ .start()
+
+{% endhighlight %}
+</div>
+<div data-lang="java" markdown="1">
+{% highlight java %}
+import org.apache.spark.sql.avro.*
+
+// `from_avro` requires Avro schema in JSON string format.
+String jsonFormatSchema = new
String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
+
+Dataset<Row> df = spark
+ .readStream()
+ .format("kafka")
+ .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+ .option("subscribe", "topic1")
+ .load()
+
+// 1. Decode the Avro data into a struct;
+// 2. Filter by column `favorite_color`;
+// 3. Encode the column `name` in Avro format.
+DataFrame output = df
+ .select(from_avro(col("value"), jsonFormatSchema).as("user"))
+ .where("user.favorite_color == \"red\"")
+ .select(to_avro(col("user.name")).as("value"))
+
+StreamingQuery ds = output
+ .writeStream()
+ .format("kafka")
+ .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+ .option("topic", "topic2")
+ .start()
+
+{% endhighlight %}
+</div>
+</div>
+
+## Data Source Option
+
+Data source options of Avro can be set using the `.option` method on
`DataFrameReader` or `DataFrameWriter`.
+<table class="table">
+ <tr><th><b>Property
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
+ <tr>
+ <td><code>avroSchema</code></td>
+ <td>None</td>
+ <td>Optional Avro schema provided by an user in JSON format.</td>
--- End diff --
We should mention the behavior when the specified schema doesn't match the
real schema.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]