Repository: spark
Updated Branches:
  refs/heads/branch-2.4 d5e694805 -> a20660b6f


[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide

## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:
<img width="1022" alt="1" 
src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png";>

Java:
<img width="1019" alt="2" 
src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png";>

Python:
<img width="1022" alt="3" 
src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png";>

R:
<img width="1024" alt="4" 
src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png";>

Closes #22675 from WeichenXu123/add_image_source_doc.

Authored-by: WeichenXu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6540c2f8f31bbde4df57e48698f46bb1815740ff)
Signed-off-by: Wenchen Fan <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a20660b6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a20660b6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a20660b6

Branch: refs/heads/branch-2.4
Commit: a20660b6fcee2d436ef182337255cea5e2eb7216
Parents: d5e6948
Author: WeichenXu <[email protected]>
Authored: Thu Oct 25 23:03:16 2018 +0800
Committer: Wenchen Fan <[email protected]>
Committed: Thu Oct 25 23:04:06 2018 +0800

----------------------------------------------------------------------
 docs/_data/menu-ml.yaml                         |   2 +
 docs/ml-datasource.md                           | 108 +++++++++++++++++++
 .../spark/ml/source/image/ImageDataSource.scala |  17 +--
 3 files changed, 120 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/a20660b6/docs/_data/menu-ml.yaml
----------------------------------------------------------------------
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index b5a6641..8e366f7 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,5 +1,7 @@
 - text: Basic statistics
   url: ml-statistics.html
+- text: Data sources
+  url: ml-datasource
 - text: Pipelines
   url: ml-pipeline.html
 - text: Extracting, transforming and selecting features

http://git-wip-us.apache.org/repos/asf/spark/blob/a20660b6/docs/ml-datasource.md
----------------------------------------------------------------------
diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md
new file mode 100644
index 0000000..1508332
--- /dev/null
+++ b/docs/ml-datasource.md
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also 
provide some specific data sources for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory, it can 
load compressed image (jpeg, png, etc.) into raw image representation via 
`ImageIO` in Java library.
+The loaded DataFrame has one `StructType` column: "image", containing image 
data stored as image schema.
+The schema of the `image` column is:
+ - origin: `StringType` (represents the file path of the image)
+ - height: `IntegerType` (height of the image)
+ - width: `IntegerType` (width of the image)
+ - nChannels: `IntegerType` (number of image channels)
+ - mode: `IntegerType` (OpenCV-compatible type)
+ - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in 
most cases)
+
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
+implements a Spark SQL data source API for loading image data as a DataFrame.
+
+{% highlight scala %}
+scala> val df = spark.read.format("image").option("dropInvalid", 
true).load("data/mllib/images/origin/kittens")
+df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: 
int ... 4 more fields>]
+
+scala> df.select("image.origin", "image.width", 
"image.height").show(truncate=false)
++-----------------------------------------------------------------------+-----+------+
+|origin                                                                 
|width|height|
++-----------------------------------------------------------------------+-----+------+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  
|311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  
|313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  
|200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  
|296   |
++-----------------------------------------------------------------------+-----+------+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
+implements Spark SQL data source API for loading image data as DataFrame.
+
+{% highlight java %}
+Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", 
true).load("data/mllib/images/origin/kittens");
+imageDF.select("image.origin", "image.width", "image.height").show(false);
+/*
+Will output:
++-----------------------------------------------------------------------+-----+------+
+|origin                                                                 
|width|height|
++-----------------------------------------------------------------------+-----+------+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  
|311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  
|313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  
|200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  
|296   |
++-----------------------------------------------------------------------+-----+------+
+*/
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+In PySpark we provide Spark SQL data source API for loading image data as 
DataFrame.
+
+{% highlight python %}
+>>> df = spark.read.format("image").option("dropInvalid", 
true).load("data/mllib/images/origin/kittens")
+>>> df.select("image.origin", "image.width", 
"image.height").show(truncate=False)
++-----------------------------------------------------------------------+-----+------+
+|origin                                                                 
|width|height|
++-----------------------------------------------------------------------+-----+------+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  
|311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  
|313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  
|200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  
|296   |
++-----------------------------------------------------------------------+-----+------+
+{% endhighlight %}
+</div>
+
+<div data-lang="r" markdown="1">
+In SparkR we provide Spark SQL data source API for loading image data as 
DataFrame.
+
+{% highlight r %}
+> df = read.df("data/mllib/images/origin/kittens", "image")
+> head(select(df, df$image.origin, df$image.width, df$image.height))
+
+1               file:///spark/data/mllib/images/origin/kittens/54893.jpg
+2            file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
+3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
+4            file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
+  width height
+1   300    311
+2   199    313
+3   300    200
+4   300    296
+
+{% endhighlight %}
+</div>
+
+
+</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/a20660b6/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala 
b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
index a111c95..d4d7408 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
@@ -19,14 +19,17 @@ package org.apache.spark.ml.source.image
 
 /**
  * `image` package implements Spark SQL data source API for loading image data 
as `DataFrame`.
- * The loaded `DataFrame` has one `StructType` column: `image`.
+ * It can load compressed image (jpeg, png, etc.) into raw image 
representation via `ImageIO`
+ * in Java library.
+ * The loaded `DataFrame` has one `StructType` column: `image`, containing 
image data stored
+ * as image schema.
  * The schema of the `image` column is:
- *  - origin: String (represents the file path of the image)
- *  - height: Int (height of the image)
- *  - width: Int (width of the image)
- *  - nChannels: Int (number of the image channels)
- *  - mode: Int (OpenCV-compatible type)
- *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR 
in most cases)
+ *  - origin: `StringType` (represents the file path of the image)
+ *  - height: `IntegerType` (height of the image)
+ *  - width: `IntegerType` (width of the image)
+ *  - nChannels: `IntegerType` (number of image channels)
+ *  - mode: `IntegerType` (OpenCV-compatible type)
+ *  - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR 
in most cases)
  *
  * To use image data source, you need to set "image" as the format in 
`DataFrameReader` and
  * optionally specify the data source options, for example:


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to