Repository: spark Updated Branches: refs/heads/master 002f9c169 -> 6540c2f8f
[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide ## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: <img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png"> Java: <img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png"> Python: <img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png"> R: <img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png"> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6540c2f8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6540c2f8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6540c2f8 Branch: refs/heads/master Commit: 6540c2f8f31bbde4df57e48698f46bb1815740ff Parents: 002f9c1 Author: WeichenXu <[email protected]> Authored: Thu Oct 25 23:03:16 2018 +0800 Committer: Wenchen Fan <[email protected]> Committed: Thu Oct 25 23:03:16 2018 +0800 ---------------------------------------------------------------------- docs/_data/menu-ml.yaml | 2 + docs/ml-datasource.md | 108 +++++++++++++++++++ .../spark/ml/source/image/ImageDataSource.scala | 17 +-- 3 files changed, 120 insertions(+), 7 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/6540c2f8/docs/_data/menu-ml.yaml ---------------------------------------------------------------------- diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml index b5a6641..8e366f7 100644 --- a/docs/_data/menu-ml.yaml +++ b/docs/_data/menu-ml.yaml @@ -1,5 +1,7 @@ - text: Basic statistics url: ml-statistics.html +- text: Data sources + url: ml-datasource - text: Pipelines url: ml-pipeline.html - text: Extracting, transforming and selecting features http://git-wip-us.apache.org/repos/asf/spark/blob/6540c2f8/docs/ml-datasource.md ---------------------------------------------------------------------- diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md new file mode 100644 index 0000000..1508332 --- /dev/null +++ b/docs/ml-datasource.md @@ -0,0 +1,108 @@ +--- +layout: global +title: Data sources +displayTitle: Data sources +--- + +In this section, we introduce how to use data source in ML to load data. +Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML. + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Image data source + +This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library. +The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema. +The schema of the `image` column is: + - origin: `StringType` (represents the file path of the image) + - height: `IntegerType` (height of the image) + - width: `IntegerType` (width of the image) + - nChannels: `IntegerType` (number of image channels) + - mode: `IntegerType` (OpenCV-compatible type) + - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases) + + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) +implements a Spark SQL data source API for loading image data as a DataFrame. + +{% highlight scala %} +scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens") +df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>] + +scala> df.select("image.origin", "image.width", "image.height").show(truncate=false) ++-----------------------------------------------------------------------+-----+------+ +|origin |width|height| ++-----------------------------------------------------------------------+-----+------+ +|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 | +|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 | +|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 | +|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 | ++-----------------------------------------------------------------------+-----+------+ +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html) +implements Spark SQL data source API for loading image data as DataFrame. + +{% highlight java %} +Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens"); +imageDF.select("image.origin", "image.width", "image.height").show(false); +/* +Will output: ++-----------------------------------------------------------------------+-----+------+ +|origin |width|height| ++-----------------------------------------------------------------------+-----+------+ +|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 | +|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 | +|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 | +|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 | ++-----------------------------------------------------------------------+-----+------+ +*/ +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +In PySpark we provide Spark SQL data source API for loading image data as DataFrame. + +{% highlight python %} +>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens") +>>> df.select("image.origin", "image.width", "image.height").show(truncate=False) ++-----------------------------------------------------------------------+-----+------+ +|origin |width|height| ++-----------------------------------------------------------------------+-----+------+ +|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 | +|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 | +|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 | +|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 | ++-----------------------------------------------------------------------+-----+------+ +{% endhighlight %} +</div> + +<div data-lang="r" markdown="1"> +In SparkR we provide Spark SQL data source API for loading image data as DataFrame. + +{% highlight r %} +> df = read.df("data/mllib/images/origin/kittens", "image") +> head(select(df, df$image.origin, df$image.width, df$image.height)) + +1 file:///spark/data/mllib/images/origin/kittens/54893.jpg +2 file:///spark/data/mllib/images/origin/kittens/DP802813.jpg +3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg +4 file:///spark/data/mllib/images/origin/kittens/DP153539.jpg + width height +1 300 311 +2 199 313 +3 300 200 +4 300 296 + +{% endhighlight %} +</div> + + +</div> http://git-wip-us.apache.org/repos/asf/spark/blob/6540c2f8/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala index a111c95..d4d7408 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala @@ -19,14 +19,17 @@ package org.apache.spark.ml.source.image /** * `image` package implements Spark SQL data source API for loading image data as `DataFrame`. - * The loaded `DataFrame` has one `StructType` column: `image`. + * It can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` + * in Java library. + * The loaded `DataFrame` has one `StructType` column: `image`, containing image data stored + * as image schema. * The schema of the `image` column is: - * - origin: String (represents the file path of the image) - * - height: Int (height of the image) - * - width: Int (width of the image) - * - nChannels: Int (number of the image channels) - * - mode: Int (OpenCV-compatible type) - * - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases) + * - origin: `StringType` (represents the file path of the image) + * - height: `IntegerType` (height of the image) + * - width: `IntegerType` (width of the image) + * - nChannels: `IntegerType` (number of image channels) + * - mode: `IntegerType` (OpenCV-compatible type) + * - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases) * * To use image data source, you need to set "image" as the format in `DataFrameReader` and * optionally specify the data source options, for example: --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
