Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22121#discussion_r210922729 --- Diff: docs/avro-data-source-guide.md --- @@ -0,0 +1,267 @@ +--- +layout: global +title: Avro Data Source Guide +--- + +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides support for reading and writing Avro data. + +## Deploying +The <code>spark-avro</code> module is external and not included in `spark-submit` or `spark-shell` by default. + +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}` +and its dependencies can be directly added to `spark-submit` using `--packages`, such as, + + ./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ... + +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly, + + ./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ... + +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies. + +## Examples + +Since `spark-avro` module is external, there is not such API as <code>.avro</code> in +<code>DataFrameReader</code> or <code>DataFrameWriter</code>. +To load/save data in Avro format, you need to specify the data source option <code>format</code> as short name <code>avro</code> or full name <code>org.apache.spark.sql.avro</code>. +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +{% highlight scala %} + +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro") +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro") + +{% endhighlight %} +</div> +<div data-lang="java" markdown="1"> +{% highlight java %} + +Dataset<Row> usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro"); +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro"); + +{% endhighlight %} +</div> +<div data-lang="python" markdown="1"> +{% highlight python %} + +df = spark.read.format("avro").load("examples/src/main/resources/users.avro") +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro") + +{% endhighlight %} +</div> +<div data-lang="r" markdown="1"> +{% highlight r %} + +df <- read.df("examples/src/main/resources/users.avro", "avro") +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro") + +{% endhighlight %} +</div> +</div> + +## Configuration --- End diff -- Space after headings like this
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org