This is an automated email from the ASF dual-hosted git repository.
meng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new fbc7942 [SPARK-27472] add user guide for binary file data source
fbc7942 is described below
commit fbc794268340bec868a0abcae3516e4ae3714286
Author: Xiangrui Meng <[email protected]>
AuthorDate: Mon Apr 29 08:58:56 2019 -0700
[SPARK-27472] add user guide for binary file data source
## What changes were proposed in this pull request?
Add user guide for binary file data source.
<img width="826" alt="Screen Shot 2019-04-28 at 10 21 26 PM"
src="https://user-images.githubusercontent.com/829644/56877594-0488d300-6a04-11e9-9064-5047dfedd913.png">
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Closes #24484 from mengxr/SPARK-27472.
Authored-by: Xiangrui Meng <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
---
docs/sql-data-sources-binaryFile.md | 80 +++++++++++++++++++++++++++++++++++++
docs/sql-data-sources.md | 1 +
2 files changed, 81 insertions(+)
diff --git a/docs/sql-data-sources-binaryFile.md
b/docs/sql-data-sources-binaryFile.md
new file mode 100644
index 0000000..d861a24
--- /dev/null
+++ b/docs/sql-data-sources-binaryFile.md
@@ -0,0 +1,80 @@
+---
+layout: global
+title: Binary File Data Source
+displayTitle: Binary File Data Source
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+Since Spark 3.0, Spark supports binary file data source,
+which reads binary files and converts each file into a single record that
contains the raw content
+and metadata of the file.
+It produces a DataFrame with the following columns and possibly partition
columns:
+* `path`: StringType
+* `modificationTime`: TimestampType
+* `length`: LongType
+* `content`: BinaryType
+
+It supports the following read option:
+<table class="table">
+ <tr><th><b>Property
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
+ <tr>
+ <td><code>pathGlobFilter</code></td>
+ <td>none (accepts all)</td>
+ <td>
+ An optional glob pattern to only include files with paths matching the
pattern.
+ The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
+ It does not change the behavior of partition discovery.
+ </td>
+ </tr>
+</table>
+
+To read whole binary files, you need to specify the data source `format` as
`binaryFile`.
+For example, the following code reads all PNG files from the input directory:
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+
+spark.read.format("binaryFile").option("pathGlobFilter",
"*.png").load("/path/to/data")
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% highlight java %}
+
+spark.read().format("binaryFile").option("pathGlobFilter",
"*.png").load("/path/to/data");
+
+{% endhighlight %}
+</div>
+<div data-lang="python" markdown="1">
+{% highlight python %}
+
+spark.read.format("binaryFile").option("pathGlobFilter",
"*.png").load("/path/to/data")
+
+{% endhighlight %}
+</div>
+<div data-lang="r" markdown="1">
+{% highlight r %}
+
+read.df("/path/to/data", source = "binaryFile", pathGlobFilter = "*.png")
+
+{% endhighlight %}
+</div>
+</div>
+
+Binary file data source does not support writing a DataFrame back to the
original files.
diff --git a/docs/sql-data-sources.md b/docs/sql-data-sources.md
index d908aac..079c540 100644
--- a/docs/sql-data-sources.md
+++ b/docs/sql-data-sources.md
@@ -54,4 +54,5 @@ goes into specific options that are available for the
built-in data sources.
* [Compatibility with Databricks
spark-avro](sql-data-sources-avro.html#compatibility-with-databricks-spark-avro)
* [Supported types for Avro -> Spark SQL
conversion](sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion)
* [Supported types for Spark SQL -> Avro
conversion](sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion)
+* [Whole Binary Files](sql-data-sources-binaryFile.html)
* [Troubleshooting](sql-data-sources-troubleshooting.html)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]