Thank you everyone for the comments and the votes. We will follow up shortly with a pull request.
On Wed, Sep 27, 2017 at 6:32 PM, Joseph Bradley <jos...@databricks.com> wrote: > This vote passes with 11 +1s (4 binding) and no +0s or -1s. > > +1: > Sean Owen (binding) > Holden Karau > Denny Lee > Reynold Xin (binding) > Joseph Bradley (binding) > Noman Khan > Weichen Xu > Yanbo Liang > Dongjoon Hyun > Matei Zaharia (binding) > Vaquar Khan > > Thanks everyone! > Joseph > > On Sat, Sep 23, 2017 at 4:23 PM, vaquar khan <vaquar.k...@gmail.com> > wrote: > >> +1 looks good, >> >> Regards, >> Vaquar khan >> >> On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >>> +1; we should consider something similar for multi-dimensional tensors >>> too. >>> >>> Matei >>> >>> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang <yblia...@gmail.com> wrote: >>> > >>> > +1 >>> > >>> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan <nomanbp...@live.com> >>> wrote: >>> > +1 >>> > >>> > Regards >>> > Noman >>> > From: Denny Lee <denny.g....@gmail.com> >>> > Sent: Friday, September 22, 2017 2:59:33 AM >>> > To: Apache Spark Dev; Sean Owen; Tim Hunter >>> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan >>> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark >>> > >>> > +1 >>> > >>> > On Thu, Sep 21, 2017 at 11:15 Sean Owen <so...@cloudera.com> wrote: >>> > Am I right that this doesn't mean other packages would use this >>> representation, but that they could? >>> > >>> > The representation looked fine to me w.r.t. what DL frameworks need. >>> > >>> > My previous comment was that this is actually quite lightweight. It's >>> kind of like how I/O support is provided for CSV and JSON, so makes enough >>> sense to add to Spark. It doesn't really preclude other solutions. >>> > >>> > For those reasons I think it's fine. +1 >>> > >>> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter <timhun...@databricks.com> >>> wrote: >>> > Hello community, >>> > >>> > I would like to call for a vote on SPARK-21866. It is a short proposal >>> that has important applications for image processing and deep learning. >>> Joseph Bradley has offered to be the shepherd. >>> > >>> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 >>> > PDF version: https://issues.apache.org/jira >>> /secure/attachment/12884792/SPIP%20-%20Image%20support%20for >>> %20Apache%20Spark%20V1.1.pdf >>> > >>> > Background and motivation >>> > As Apache Spark is being used more and more in the industry, some new >>> use cases are emerging for different data formats beyond the traditional >>> SQL types or the numerical types (vectors and matrices). Deep Learning >>> applications commonly deal with image processing. A number of projects add >>> some Deep Learning capabilities to Spark (see list below), but they >>> struggle to communicate with each other or with MLlib pipelines because >>> there is no standard way to represent an image in Spark DataFrames. We >>> propose to federate efforts for representing images in Spark by defining a >>> representation that caters to the most common needs of users and library >>> developers. >>> > This SPIP proposes a specification to represent images in Spark >>> DataFrames and Datasets (based on existing industrial standards), and an >>> interface for loading sources of images. It is not meant to be a >>> full-fledged image processing library, but rather the core description that >>> other libraries and users can rely on. Several packages already offer >>> various processing facilities for transforming images or doing more complex >>> operations, and each has various design tradeoffs that make them better as >>> standalone solutions. >>> > This project is a joint collaboration between Microsoft and >>> Databricks, which have been testing this design in two open source >>> packages: MMLSpark and Deep Learning Pipelines. >>> > The proposed image format is an in-memory, decompressed representation >>> that targets low-level applications. It is significantly more liberal in >>> memory usage than compressed image representations such as JPEG, PNG, etc., >>> but it allows easy communication with popular image processing libraries >>> and has no decoding overhead. >>> > Targets users and personas: >>> > Data scientists, data engineers, library developers. >>> > The following libraries define primitives for loading and representing >>> images, and will gain from a common interchange format (in alphabetical >>> order): >>> > • BigDL >>> > • DeepLearning4J >>> > • Deep Learning Pipelines >>> > • MMLSpark >>> > • TensorFlow (Spark connector) >>> > • TensorFlowOnSpark >>> > • TensorFrames >>> > • Thunder >>> > Goals: >>> > • Simple representation of images in Spark DataFrames, based on >>> pre-existing industrial standards (OpenCV) >>> > • This format should eventually allow the development of >>> high-performance integration points with image processing libraries such as >>> libOpenCV, Google TensorFlow, CNTK, and other C libraries. >>> > • The reader should be able to read popular formats of images >>> from distributed sources. >>> > Non-Goals: >>> > Images are a versatile medium and encompass a very wide range of >>> formats and representations. This SPIP explicitly aims at the most common >>> use case in the industry currently: multi-channel matrices of binary, >>> int32, int64, float or double data that can fit comfortably in the heap of >>> the JVM: >>> > • the total size of an image should be restricted to less than >>> 2GB (roughly) >>> > • the meaning of color channels is application-specific and is >>> not mandated by the standard (in line with the OpenCV standard) >>> > • specialized formats used in meteorology, the medical field, >>> etc. are not supported >>> > • this format is specialized to images and does not attempt to >>> solve the more general problem of representing n-dimensional tensors in >>> Spark >>> > Proposed API changes >>> > We propose to add a new package in the package structure, under the >>> MLlib project: >>> > org.apache.spark.image >>> > Data format >>> > We propose to add the following structure: >>> > imageSchema = StructType([ >>> > • StructField("mode", StringType(), False), >>> > • The exact representation of the data. >>> > • The values are described in the following OpenCV >>> convention. Basically, the type has both "depth" and "number of channels" >>> info: in particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA >>> format would be CV_8UC4 (value 32 in the table) with the channel order >>> specified by convention. >>> > • The exact channel ordering and meaning of each channel >>> is dictated by convention. By default, the order is RGB (3 channels) and >>> BGRA (4 channels). >>> > If the image failed to load, the value is the empty string "". >>> > • StructField("origin", StringType(), True), >>> > • Some information about the origin of the image. The >>> content of this is application-specific. >>> > • When the image is loaded from files, users should >>> expect to find the file name in this field. >>> > • StructField("height", IntegerType(), False), >>> > • the height of the image, pixels >>> > • If the image fails to load, the value is -1. >>> > • StructField("width", IntegerType(), False), >>> > • the width of the image, pixels >>> > • If the image fails to load, the value is -1. >>> > • StructField("nChannels", IntegerType(), False), >>> > • The number of channels in this image: it is typically >>> a value of 1 (B&W), 3 (RGB), or 4 (BGRA) >>> > • If the image fails to load, the value is -1. >>> > • StructField("data", BinaryType(), False) >>> > • packed array content. Due to implementation >>> limitation, it cannot currently store more than 2 billions of pixels. >>> > • The data is stored in a pixel-by-pixel BGR row-wise >>> order. This follows the OpenCV convention. >>> > • If the image fails to load, this array is empty. >>> > For more information about image types, here is an OpenCV guide on >>> types: http://docs.opencv.org/2.4/modules/core/doc/intro.html#fixed >>> -pixel-types-limited-use-of-templates >>> > The reference implementation provides some functions to convert >>> popular formats (JPEG, PNG, etc.) to the image specification above, and >>> some functions to verify if an image is valid. >>> > Image ingest API >>> > We propose the following function to load images from a remote >>> distributed source as a DataFrame. Here is the signature in Scala. The >>> python interface is similar. For compatibility with java, this function >>> should be made available through a builder pattern or through the >>> DataSource API. The exact mechanics can be discussed during implementation; >>> the goal of the proposal below is to propose a specification of the >>> behavior and of the options: >>> > def readImages( >>> > path: >>> > String >>> > , >>> > session: SparkSession = >>> > null >>> > , >>> > recursive: >>> > Boolean = false >>> > , >>> > numPartitions: Int = 0, >>> > dropImageFailures: >>> > Boolean = false >>> > , >>> > >>> > // Experimental options >>> > >>> > sampleRatio: Double >>> > = 1.0): DataFrame >>> > >>> > The type of the returned DataFrame should be the structure type above, >>> with the expectation that all the file names be filled. >>> > Mandatory parameters: >>> > • path: a directory for a file system that contains images >>> > Optional parameters: >>> > • session (SparkSession, default null): the Spark Session to use >>> to create the dataframe. If not provided, it will use the current default >>> Spark session via SparkSession.getOrCreate(). >>> > • recursive (bool, default false): take the top-level images or >>> look into directory recursively >>> > • numPartitions (int, default null): the number of partitions of >>> the final dataframe. By default uses the default number of partitions from >>> Spark. >>> > • dropImageFailures (bool, default false): drops the files that >>> failed to load. If false (do not drop), some invalid images are kept. >>> > Parameters that are experimental/may be quickly deprecated. These >>> would be useful to have but are not critical for a first cut: >>> > • sampleRatio (float, in (0,1), default 1): if less than 1, >>> returns a fraction of the data. There is no statistical guarantee about how >>> the sampling is performed. This proved to be very helpful for fast >>> prototyping. Marked as experimental since it should be pushed to the Spark >>> core. >>> > The implementation is expected to be in Scala for performance, with a >>> wrapper for python. >>> > This function should be lazy to the extent possible: it should not >>> trigger access to the data when called. Ideally, any file system supported >>> by Spark should be supported when loading images. There may be restrictions >>> for some options such as zip files, etc. >>> > The reference implementation has also some experimental options >>> (undocumented here). >>> > Reference implementation >>> > A reference implementation is available as an open-source Spark >>> package in this repository (Apache 2.0 license): >>> > https://github.com/Microsoft/spark-images >>> > This Spark package will also be published in a binary form on >>> spark-packages.org . >>> > Comments about the API should be addressed in this ticket. >>> > Optional Rejected Designs >>> > The use of User-Defined Types was considered. It adds some burden to >>> the implementation of various languages and does not provide significant >>> advantages. >>> > >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >> >> -- >> Regards, >> Vaquar Khan >> +1 -224-436-0783 <(224)%20436-0783> >> Greater Chicago >> > > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> >