GitHub user imatiach-msft opened a pull request:

    https://github.com/apache/spark/pull/19439

    [SPARK-21866][ML][PySpark] Adding spark image reader

    ## What changes were proposed in this pull request?
    Adding spark image reader, an implementation of schema for representing 
images in spark DataFrames
    
    The code is taken from the spark package located here:
    (https://github.com/Microsoft/spark-images)
    
    Please see the JIRA for more information 
(https://issues.apache.org/jira/browse/SPARK-21866)
    
    Please see mailing list for SPIP vote and approval information:
    
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)
    
    # Background and motivation
    As Apache Spark is being used more and more in the industry, some new use 
cases are emerging for different data formats beyond the traditional SQL types 
or the numerical types (vectors and matrices). Deep Learning applications 
commonly deal with image processing. A number of projects add some Deep 
Learning capabilities to Spark (see list below), but they struggle to 
communicate with each other or with MLlib pipelines because there is no 
standard way to represent an image in Spark DataFrames. We propose to federate 
efforts for representing images in Spark by defining a representation that 
caters to the most common needs of users and library developers.
    This SPIP proposes a specification to represent images in Spark DataFrames 
and Datasets (based on existing industrial standards), and an interface for 
loading sources of images. It is not meant to be a full-fledged image 
processing library, but rather the core description that other libraries and 
users can rely on. Several packages already offer various processing facilities 
for transforming images or doing more complex operations, and each has various 
design tradeoffs that make them better as standalone solutions.
    This project is a joint collaboration between Microsoft and Databricks, 
which have been testing this design in two open source packages: MMLSpark and 
Deep Learning Pipelines.
    The proposed image format is an in-memory, decompressed representation that 
targets low-level applications. It is significantly more liberal in memory 
usage than compressed image representations such as JPEG, PNG, etc., but it 
allows easy communication with popular image processing libraries and has no 
decoding overhead.
    
    ## How was this patch tested?
    
    Unit tests in scala ImageSchemaSuite, unit tests in python

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/imatiach-msft/spark ilmat/spark-images

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19439.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19439
    
----
commit 22baf022b2f109bb1f5eba0b13ea34de894cd14c
Author: Ilya Matiach <il...@microsoft.com>
Date:   2017-10-04T21:10:26Z

    [SPARK-21866][ML][PySpark] Adding spark image reader

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to