[spark] branch master updated: [SPARK-27472] add user guide for binary file data source

meng Mon, 29 Apr 2019 08:59:55 -0700

This is an automated email from the ASF dual-hosted git repository.

meng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new fbc7942  [SPARK-27472] add user guide for binary file data source
fbc7942 is described below

commit fbc794268340bec868a0abcae3516e4ae3714286
Author: Xiangrui Meng <[email protected]>
AuthorDate: Mon Apr 29 08:58:56 2019 -0700

    [SPARK-27472] add user guide for binary file data source
    
    ## What changes were proposed in this pull request?
    
    Add user guide for binary file data source.
    
    <img width="826" alt="Screen Shot 2019-04-28 at 10 21 26 PM" 
src="https://user-images.githubusercontent.com/829644/56877594-0488d300-6a04-11e9-9064-5047dfedd913.png";>
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Closes #24484 from mengxr/SPARK-27472.
    
    Authored-by: Xiangrui Meng <[email protected]>
    Signed-off-by: Xiangrui Meng <[email protected]>
---
 docs/sql-data-sources-binaryFile.md | 80 +++++++++++++++++++++++++++++++++++++
 docs/sql-data-sources.md            |  1 +
 2 files changed, 81 insertions(+)

diff --git a/docs/sql-data-sources-binaryFile.md 
b/docs/sql-data-sources-binaryFile.md
new file mode 100644
index 0000000..d861a24
--- /dev/null
+++ b/docs/sql-data-sources-binaryFile.md
@@ -0,0 +1,80 @@
+---
+layout: global
+title: Binary File Data Source
+displayTitle: Binary File Data Source
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Since Spark 3.0, Spark supports binary file data source,
+which reads binary files and converts each file into a single record that 
contains the raw content
+and metadata of the file.
+It produces a DataFrame with the following columns and possibly partition 
columns:
+* `path`: StringType
+* `modificationTime`: TimestampType
+* `length`: LongType
+* `content`: BinaryType
+
+It supports the following read option:
+<table class="table">
+  <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
+  <tr>
+    <td><code>pathGlobFilter</code></td>
+    <td>none (accepts all)</td>
+    <td>
+    An optional glob pattern to only include files with paths matching the 
pattern.
+    The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
+    It does not change the behavior of partition discovery.
+    </td>
+  </tr>
+</table>
+
+To read whole binary files, you need to specify the data source `format` as 
`binaryFile`.
+For example, the following code reads all PNG files from the input directory:
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+
+spark.read.format("binaryFile").option("pathGlobFilter", 
"*.png").load("/path/to/data")
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% highlight java %}
+
+spark.read().format("binaryFile").option("pathGlobFilter", 
"*.png").load("/path/to/data");
+
+{% endhighlight %}
+</div>
+<div data-lang="python" markdown="1">
+{% highlight python %}
+
+spark.read.format("binaryFile").option("pathGlobFilter", 
"*.png").load("/path/to/data")
+
+{% endhighlight %}
+</div>
+<div data-lang="r" markdown="1">
+{% highlight r %}
+
+read.df("/path/to/data", source = "binaryFile", pathGlobFilter = "*.png")
+
+{% endhighlight %}
+</div>
+</div>
+
+Binary file data source does not support writing a DataFrame back to the 
original files.
diff --git a/docs/sql-data-sources.md b/docs/sql-data-sources.md
index d908aac..079c540 100644
--- a/docs/sql-data-sources.md
+++ b/docs/sql-data-sources.md
@@ -54,4 +54,5 @@ goes into specific options that are available for the 
built-in data sources.
   * [Compatibility with Databricks 
spark-avro](sql-data-sources-avro.html#compatibility-with-databricks-spark-avro)
   * [Supported types for Avro -> Spark SQL 
conversion](sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion)
   * [Supported types for Spark SQL -> Avro 
conversion](sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion)
+* [Whole Binary Files](sql-data-sources-binaryFile.html)
 * [Troubleshooting](sql-data-sources-troubleshooting.html)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-27472] add user guide for binary file data source

Reply via email to