Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
Thank you everyone for the comments and the votes. We will follow up shortly with a pull request. On Wed, Sep 27, 2017 at 6:32 PM, Joseph Bradley wrote: > This vote passes with 11 +1s (4 binding) and no +0s or -1s. > > +1: > Sean Owen (binding) > Holden Karau > Denny Lee > Reynold Xin (binding) > Joseph Bradley (binding) > Noman Khan > Weichen Xu > Yanbo Liang > Dongjoon Hyun > Matei Zaharia (binding) > Vaquar Khan > > Thanks everyone! > Joseph > > On Sat, Sep 23, 2017 at 4:23 PM, vaquar khan > wrote: > >> +1 looks good, >> >> Regards, >> Vaquar khan >> >> On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia >> wrote: >> >>> +1; we should consider something similar for multi-dimensional tensors >>> too. >>> >>> Matei >>> >>> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang wrote: >>> > >>> > +1 >>> > >>> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan >>> wrote: >>> > +1 >>> > >>> > Regards >>> > Noman >>> > From: Denny Lee >>> > Sent: Friday, September 22, 2017 2:59:33 AM >>> > To: Apache Spark Dev; Sean Owen; Tim Hunter >>> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan >>> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark >>> > >>> > +1 >>> > >>> > On Thu, Sep 21, 2017 at 11:15 Sean Owen wrote: >>> > Am I right that this doesn't mean other packages would use this >>> representation, but that they could? >>> > >>> > The representation looked fine to me w.r.t. what DL frameworks need. >>> > >>> > My previous comment was that this is actually quite lightweight. It's >>> kind of like how I/O support is provided for CSV and JSON, so makes enough >>> sense to add to Spark. It doesn't really preclude other solutions. >>> > >>> > For those reasons I think it's fine. +1 >>> > >>> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter >>> wrote: >>> > Hello community, >>> > >>> > I would like to call for a vote on SPARK-21866. It is a short proposal >>> that has important applications for image processing and deep learning. >>> Joseph Bradley has offered to be the shepherd. >>> > >>> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 >>> > PDF version: https://issues.apache.org/jira >>> /secure/attachment/12884792/SPIP%20-%20Image%20support%20for >>> %20Apache%20Spark%20V1.1.pdf >>> > >>> > Background and motivation >>> > As Apache Spark is being used more and more in the industry, some new >>> use cases are emerging for different data formats beyond the traditional >>> SQL types or the numerical types (vectors and matrices). Deep Learning >>> applications commonly deal with image processing. A number of projects add >>> some Deep Learning capabilities to Spark (see list below), but they >>> struggle to communicate with each other or with MLlib pipelines because >>> there is no standard way to represent an image in Spark DataFrames. We >>> propose to federate efforts for representing images in Spark by defining a >>> representation that caters to the most common needs of users and library >>> developers. >>> > This SPIP proposes a specification to represent images in Spark >>> DataFrames and Datasets (based on existing industrial standards), and an >>> interface for loading sources of images. It is not meant to be a >>> full-fledged image processing library, but rather the core description that >>> other libraries and users can rely on. Several packages already offer >>> various processing facilities for transforming images or doing more complex >>> operations, and each has various design tradeoffs that make them better as >>> standalone solutions. >>> > This project is a joint collaboration between Microsoft and >>> Databricks, which have been testing this design in two open source >>> packages: MMLSpark and Deep Learning Pipelines. >>> > The proposed image format is an in-memory, decompressed representation >>> that targets low-level applications. It is significantly more liberal in >>> memory usage than compressed image representations such as JPEG, PNG, etc., >>> but it allows easy communication with popular image processing libraries >>> and has no decoding
Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
This vote passes with 11 +1s (4 binding) and no +0s or -1s. +1: Sean Owen (binding) Holden Karau Denny Lee Reynold Xin (binding) Joseph Bradley (binding) Noman Khan Weichen Xu Yanbo Liang Dongjoon Hyun Matei Zaharia (binding) Vaquar Khan Thanks everyone! Joseph On Sat, Sep 23, 2017 at 4:23 PM, vaquar khan wrote: > +1 looks good, > > Regards, > Vaquar khan > > On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia > wrote: > >> +1; we should consider something similar for multi-dimensional tensors >> too. >> >> Matei >> >> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang wrote: >> > >> > +1 >> > >> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan >> wrote: >> > +1 >> > >> > Regards >> > Noman >> > From: Denny Lee >> > Sent: Friday, September 22, 2017 2:59:33 AM >> > To: Apache Spark Dev; Sean Owen; Tim Hunter >> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan >> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark >> > >> > +1 >> > >> > On Thu, Sep 21, 2017 at 11:15 Sean Owen wrote: >> > Am I right that this doesn't mean other packages would use this >> representation, but that they could? >> > >> > The representation looked fine to me w.r.t. what DL frameworks need. >> > >> > My previous comment was that this is actually quite lightweight. It's >> kind of like how I/O support is provided for CSV and JSON, so makes enough >> sense to add to Spark. It doesn't really preclude other solutions. >> > >> > For those reasons I think it's fine. +1 >> > >> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter >> wrote: >> > Hello community, >> > >> > I would like to call for a vote on SPARK-21866. It is a short proposal >> that has important applications for image processing and deep learning. >> Joseph Bradley has offered to be the shepherd. >> > >> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 >> > PDF version: https://issues.apache.org/jira/secure/attachment/12884792/ >> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf >> > >> > Background and motivation >> > As Apache Spark is being used more and more in the industry, some new >> use cases are emerging for different data formats beyond the traditional >> SQL types or the numerical types (vectors and matrices). Deep Learning >> applications commonly deal with image processing. A number of projects add >> some Deep Learning capabilities to Spark (see list below), but they >> struggle to communicate with each other or with MLlib pipelines because >> there is no standard way to represent an image in Spark DataFrames. We >> propose to federate efforts for representing images in Spark by defining a >> representation that caters to the most common needs of users and library >> developers. >> > This SPIP proposes a specification to represent images in Spark >> DataFrames and Datasets (based on existing industrial standards), and an >> interface for loading sources of images. It is not meant to be a >> full-fledged image processing library, but rather the core description that >> other libraries and users can rely on. Several packages already offer >> various processing facilities for transforming images or doing more complex >> operations, and each has various design tradeoffs that make them better as >> standalone solutions. >> > This project is a joint collaboration between Microsoft and Databricks, >> which have been testing this design in two open source packages: MMLSpark >> and Deep Learning Pipelines. >> > The proposed image format is an in-memory, decompressed representation >> that targets low-level applications. It is significantly more liberal in >> memory usage than compressed image representations such as JPEG, PNG, etc., >> but it allows easy communication with popular image processing libraries >> and has no decoding overhead. >> > Targets users and personas: >> > Data scientists, data engineers, library developers. >> > The following libraries define primitives for loading and representing >> images, and will gain from a common interchange format (in alphabetical >> order): >> > • BigDL >> > • DeepLearning4J >> > • Deep Learning Pipelines >> > • MMLSpark >> > • TensorFlow (Spark connector) >> > • TensorFlowOnSpark >> > • TensorFrames >> > • Thun
Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
+1 looks good, Regards, Vaquar khan On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia wrote: > +1; we should consider something similar for multi-dimensional tensors too. > > Matei > > > On Sep 23, 2017, at 7:27 AM, Yanbo Liang wrote: > > > > +1 > > > > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan wrote: > > +1 > > > > Regards > > Noman > > From: Denny Lee > > Sent: Friday, September 22, 2017 2:59:33 AM > > To: Apache Spark Dev; Sean Owen; Tim Hunter > > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan > > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark > > > > +1 > > > > On Thu, Sep 21, 2017 at 11:15 Sean Owen wrote: > > Am I right that this doesn't mean other packages would use this > representation, but that they could? > > > > The representation looked fine to me w.r.t. what DL frameworks need. > > > > My previous comment was that this is actually quite lightweight. It's > kind of like how I/O support is provided for CSV and JSON, so makes enough > sense to add to Spark. It doesn't really preclude other solutions. > > > > For those reasons I think it's fine. +1 > > > > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter > wrote: > > Hello community, > > > > I would like to call for a vote on SPARK-21866. It is a short proposal > that has important applications for image processing and deep learning. > Joseph Bradley has offered to be the shepherd. > > > > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 > > PDF version: https://issues.apache.org/jira/secure/attachment/ > 12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf > > > > Background and motivation > > As Apache Spark is being used more and more in the industry, some new > use cases are emerging for different data formats beyond the traditional > SQL types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they > struggle to communicate with each other or with MLlib pipelines because > there is no standard way to represent an image in Spark DataFrames. We > propose to federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > > This SPIP proposes a specification to represent images in Spark > DataFrames and Datasets (based on existing industrial standards), and an > interface for loading sources of images. It is not meant to be a > full-fledged image processing library, but rather the core description that > other libraries and users can rely on. Several packages already offer > various processing facilities for transforming images or doing more complex > operations, and each has various design tradeoffs that make them better as > standalone solutions. > > This project is a joint collaboration between Microsoft and Databricks, > which have been testing this design in two open source packages: MMLSpark > and Deep Learning Pipelines. > > The proposed image format is an in-memory, decompressed representation > that targets low-level applications. It is significantly more liberal in > memory usage than compressed image representations such as JPEG, PNG, etc., > but it allows easy communication with popular image processing libraries > and has no decoding overhead. > > Targets users and personas: > > Data scientists, data engineers, library developers. > > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > > • BigDL > > • DeepLearning4J > > • Deep Learning Pipelines > > • MMLSpark > > • TensorFlow (Spark connector) > > • TensorFlowOnSpark > > • TensorFrames > > • Thunder > > Goals: > > • Simple representation of images in Spark DataFrames, based on > pre-existing industrial standards (OpenCV) > > • This format should eventually allow the development of > high-performance integration points with image processing libraries such as > libOpenCV, Google TensorFlow, CNTK, and other C libraries. > > • The reader should be able to read popular formats of images from > distributed sources. > > Non-Goals: > > Images are a versatile medium and encompass a very wide range of formats > and representations. This SPIP explicitly aims at the most common use case > in the industry currently: multi-channel matrices
Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
+1; we should consider something similar for multi-dimensional tensors too. Matei > On Sep 23, 2017, at 7:27 AM, Yanbo Liang wrote: > > +1 > > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan wrote: > +1 > > Regards > Noman > From: Denny Lee > Sent: Friday, September 22, 2017 2:59:33 AM > To: Apache Spark Dev; Sean Owen; Tim Hunter > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark > > +1 > > On Thu, Sep 21, 2017 at 11:15 Sean Owen wrote: > Am I right that this doesn't mean other packages would use this > representation, but that they could? > > The representation looked fine to me w.r.t. what DL frameworks need. > > My previous comment was that this is actually quite lightweight. It's kind of > like how I/O support is provided for CSV and JSON, so makes enough sense to > add to Spark. It doesn't really preclude other solutions. > > For those reasons I think it's fine. +1 > > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter wrote: > Hello community, > > I would like to call for a vote on SPARK-21866. It is a short proposal that > has important applications for image processing and deep learning. Joseph > Bradley has offered to be the shepherd. > > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 > PDF version: > https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf > > Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > • BigDL > • DeepLearning4J > • Deep Learning Pipelines > • MMLSpark > • TensorFlow (Spark connector) > • TensorFlowOnSpark > • TensorFrames > • Thunder > Goals: > • Simple representation of images in Spark DataFrames, based on > pre-existing industrial standards (OpenCV) > • This format should eventually allow the development of > high-performance integration points with image processing libraries such as > libOpenCV, Google TensorFlow, CNTK, and other C libraries. > • The reader should be able to read popular formats of images from > distributed sources. > Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > • the total size of an image should be restricted to less than 2GB > (roughly) > • the meaning of color channels is application-specific and is not > mandated by the standard (in line with the OpenCV standard) >
Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
+1 On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan wrote: > +1 > > Regards > Noman > -- > *From:* Denny Lee > *Sent:* Friday, September 22, 2017 2:59:33 AM > *To:* Apache Spark Dev; Sean Owen; Tim Hunter > *Cc:* Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan > *Subject:* Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark > > +1 > > On Thu, Sep 21, 2017 at 11:15 Sean Owen wrote: > >> Am I right that this doesn't mean other packages would use this >> representation, but that they could? >> >> The representation looked fine to me w.r.t. what DL frameworks need. >> >> My previous comment was that this is actually quite lightweight. It's >> kind of like how I/O support is provided for CSV and JSON, so makes enough >> sense to add to Spark. It doesn't really preclude other solutions. >> >> For those reasons I think it's fine. +1 >> >> On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter >> wrote: >> >>> Hello community, >>> >>> I would like to call for a vote on SPARK-21866. It is a short proposal >>> that has important applications for image processing and deep learning. >>> Joseph Bradley has offered to be the shepherd. >>> >>> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 >>> PDF version: https://issues.apache.org/jira/secure/attachment/12884792/ >>> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf >>> >>> Background and motivation >>> >>> As Apache Spark is being used more and more in the industry, some new >>> use cases are emerging for different data formats beyond the traditional >>> SQL types or the numerical types (vectors and matrices). Deep Learning >>> applications commonly deal with image processing. A number of projects add >>> some Deep Learning capabilities to Spark (see list below), but they >>> struggle to communicate with each other or with MLlib pipelines because >>> there is no standard way to represent an image in Spark DataFrames. We >>> propose to federate efforts for representing images in Spark by defining a >>> representation that caters to the most common needs of users and library >>> developers. >>> >>> This SPIP proposes a specification to represent images in Spark >>> DataFrames and Datasets (based on existing industrial standards), and an >>> interface for loading sources of images. It is not meant to be a >>> full-fledged image processing library, but rather the core description that >>> other libraries and users can rely on. Several packages already offer >>> various processing facilities for transforming images or doing more complex >>> operations, and each has various design tradeoffs that make them better as >>> standalone solutions. >>> >>> This project is a joint collaboration between Microsoft and Databricks, >>> which have been testing this design in two open source packages: MMLSpark >>> and Deep Learning Pipelines. >>> >>> The proposed image format is an in-memory, decompressed representation >>> that targets low-level applications. It is significantly more liberal in >>> memory usage than compressed image representations such as JPEG, PNG, etc., >>> but it allows easy communication with popular image processing libraries >>> and has no decoding overhead. >>> Targets users and personas: >>> >>> Data scientists, data engineers, library developers. >>> The following libraries define primitives for loading and representing >>> images, and will gain from a common interchange format (in alphabetical >>> order): >>> >>>- BigDL >>>- DeepLearning4J >>>- Deep Learning Pipelines >>>- MMLSpark >>>- TensorFlow (Spark connector) >>>- TensorFlowOnSpark >>>- TensorFrames >>>- Thunder >>> >>> Goals: >>> >>>- Simple representation of images in Spark DataFrames, based on >>>pre-existing industrial standards (OpenCV) >>>- This format should eventually allow the development of >>>high-performance integration points with image processing libraries such >>> as >>>libOpenCV, Google TensorFlow, CNTK, and other C libraries. >>>- The reader should be able to read popular formats of images from >>>distributed sources. >>> >>> Non-Goals: >>> >>> Images are a versatile medium and e
Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
+1 Regards Noman From: Denny Lee Sent: Friday, September 22, 2017 2:59:33 AM To: Apache Spark Dev; Sean Owen; Tim Hunter Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark +1 On Thu, Sep 21, 2017 at 11:15 Sean Owen mailto:so...@cloudera.com>> wrote: Am I right that this doesn't mean other packages would use this representation, but that they could? The representation looked fine to me w.r.t. what DL frameworks need. My previous comment was that this is actually quite lightweight. It's kind of like how I/O support is provided for CSV and JSON, so makes enough sense to add to Spark. It doesn't really preclude other solutions. For those reasons I think it's fine. +1 On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter mailto:timhun...@databricks.com>> wrote: Hello community, I would like to call for a vote on SPARK-21866. It is a short proposal that has important applications for image processing and deep learning. Joseph Bradley has offered to be the shepherd. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 PDF version: https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf Background and motivation As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers. This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions. This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines. The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead. Targets users and personas: Data scientists, data engineers, library developers. The following libraries define primitives for loading and representing images, and will gain from a common interchange format (in alphabetical order): * BigDL * DeepLearning4J * Deep Learning Pipelines * MMLSpark * TensorFlow (Spark connector) * TensorFlowOnSpark * TensorFrames * Thunder Goals: * Simple representation of images in Spark DataFrames, based on pre-existing industrial standards (OpenCV) * This format should eventually allow the development of high-performance integration points with image processing libraries such as libOpenCV, Google TensorFlow, CNTK, and other C libraries. * The reader should be able to read popular formats of images from distributed sources. Non-Goals: Images are a versatile medium and encompass a very wide range of formats and representations. This SPIP explicitly aims at the most common use case in the industry currently: multi-channel matrices of binary, int32, int64, float or double data that can fit comfortably in the heap of the JVM: * the total size of an image should be restricted to less than 2GB (roughly) * the meaning of color channels is application-specific and is not mandated by the standard (in line with the OpenCV standard) * specialized formats used in meteorology, the medical field, etc. are not supported * this format is specialized to images and does not attempt to solve the more general problem of representing n-dimensional tensors in Spark Proposed API changes We propose to add a new package in the package structure, under the MLlib project: org.apache.spark.image Data format We propose to add the following structure: imageSchema = StructType([ * StructField("mode", StringType(), False), * The exact representation of the data. * The values are described in the fo
Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
+1 On Thu, Sep 21, 2017 at 11:15 Sean Owen wrote: > Am I right that this doesn't mean other packages would use this > representation, but that they could? > > The representation looked fine to me w.r.t. what DL frameworks need. > > My previous comment was that this is actually quite lightweight. It's kind > of like how I/O support is provided for CSV and JSON, so makes enough sense > to add to Spark. It doesn't really preclude other solutions. > > For those reasons I think it's fine. +1 > > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter > wrote: > >> Hello community, >> >> I would like to call for a vote on SPARK-21866. It is a short proposal >> that has important applications for image processing and deep learning. >> Joseph Bradley has offered to be the shepherd. >> >> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 >> PDF version: https://issues.apache.org/jira/secure/attachment/12884792/ >> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf >> >> Background and motivation >> >> As Apache Spark is being used more and more in the industry, some new use >> cases are emerging for different data formats beyond the traditional SQL >> types or the numerical types (vectors and matrices). Deep Learning >> applications commonly deal with image processing. A number of projects add >> some Deep Learning capabilities to Spark (see list below), but they >> struggle to communicate with each other or with MLlib pipelines because >> there is no standard way to represent an image in Spark DataFrames. We >> propose to federate efforts for representing images in Spark by defining a >> representation that caters to the most common needs of users and library >> developers. >> >> This SPIP proposes a specification to represent images in Spark >> DataFrames and Datasets (based on existing industrial standards), and an >> interface for loading sources of images. It is not meant to be a >> full-fledged image processing library, but rather the core description that >> other libraries and users can rely on. Several packages already offer >> various processing facilities for transforming images or doing more complex >> operations, and each has various design tradeoffs that make them better as >> standalone solutions. >> >> This project is a joint collaboration between Microsoft and Databricks, >> which have been testing this design in two open source packages: MMLSpark >> and Deep Learning Pipelines. >> >> The proposed image format is an in-memory, decompressed representation >> that targets low-level applications. It is significantly more liberal in >> memory usage than compressed image representations such as JPEG, PNG, etc., >> but it allows easy communication with popular image processing libraries >> and has no decoding overhead. >> Targets users and personas: >> >> Data scientists, data engineers, library developers. >> The following libraries define primitives for loading and representing >> images, and will gain from a common interchange format (in alphabetical >> order): >> >>- BigDL >>- DeepLearning4J >>- Deep Learning Pipelines >>- MMLSpark >>- TensorFlow (Spark connector) >>- TensorFlowOnSpark >>- TensorFrames >>- Thunder >> >> Goals: >> >>- Simple representation of images in Spark DataFrames, based on >>pre-existing industrial standards (OpenCV) >>- This format should eventually allow the development of >>high-performance integration points with image processing libraries such >> as >>libOpenCV, Google TensorFlow, CNTK, and other C libraries. >>- The reader should be able to read popular formats of images from >>distributed sources. >> >> Non-Goals: >> >> Images are a versatile medium and encompass a very wide range of formats >> and representations. This SPIP explicitly aims at the most common use >> case in the industry currently: multi-channel matrices of binary, int32, >> int64, float or double data that can fit comfortably in the heap of the JVM: >> >>- the total size of an image should be restricted to less than 2GB >>(roughly) >>- the meaning of color channels is application-specific and is not >>mandated by the standard (in line with the OpenCV standard) >>- specialized formats used in meteorology, the medical field, etc. >>are not supported >>- this format is specialized to images and does not attempt to solve >>the more general problem of representing n-dimensional tensors in Spark >> >> Proposed API changes >> >> We propose to add a new package in the package structure, under the MLlib >> project: >> org.apache.spark.image >> Data format >> >> We propose to add the following structure: >> >> imageSchema = StructType([ >> >>- StructField("mode", StringType(), False), >> - The exact representation of the data. >> - The values are described in the following OpenCV convention. >> Basically, the type has both "depth" and "number of channels" info: in >> particular, typ
Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
Am I right that this doesn't mean other packages would use this representation, but that they could? The representation looked fine to me w.r.t. what DL frameworks need. My previous comment was that this is actually quite lightweight. It's kind of like how I/O support is provided for CSV and JSON, so makes enough sense to add to Spark. It doesn't really preclude other solutions. For those reasons I think it's fine. +1 On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter wrote: > Hello community, > > I would like to call for a vote on SPARK-21866. It is a short proposal > that has important applications for image processing and deep learning. > Joseph Bradley has offered to be the shepherd. > > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 > PDF version: https://issues.apache.org/jira/secure/attachment/12884792/ > SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf > > Background and motivation > > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they > struggle to communicate with each other or with MLlib pipelines because > there is no standard way to represent an image in Spark DataFrames. We > propose to federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > > This SPIP proposes a specification to represent images in Spark > DataFrames and Datasets (based on existing industrial standards), and an > interface for loading sources of images. It is not meant to be a > full-fledged image processing library, but rather the core description that > other libraries and users can rely on. Several packages already offer > various processing facilities for transforming images or doing more complex > operations, and each has various design tradeoffs that make them better as > standalone solutions. > > This project is a joint collaboration between Microsoft and Databricks, > which have been testing this design in two open source packages: MMLSpark > and Deep Learning Pipelines. > > The proposed image format is an in-memory, decompressed representation > that targets low-level applications. It is significantly more liberal in > memory usage than compressed image representations such as JPEG, PNG, etc., > but it allows easy communication with popular image processing libraries > and has no decoding overhead. > Targets users and personas: > > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > >- BigDL >- DeepLearning4J >- Deep Learning Pipelines >- MMLSpark >- TensorFlow (Spark connector) >- TensorFlowOnSpark >- TensorFrames >- Thunder > > Goals: > >- Simple representation of images in Spark DataFrames, based on >pre-existing industrial standards (OpenCV) >- This format should eventually allow the development of >high-performance integration points with image processing libraries such as >libOpenCV, Google TensorFlow, CNTK, and other C libraries. >- The reader should be able to read popular formats of images from >distributed sources. > > Non-Goals: > > Images are a versatile medium and encompass a very wide range of formats > and representations. This SPIP explicitly aims at the most common use > case in the industry currently: multi-channel matrices of binary, int32, > int64, float or double data that can fit comfortably in the heap of the JVM: > >- the total size of an image should be restricted to less than 2GB >(roughly) >- the meaning of color channels is application-specific and is not >mandated by the standard (in line with the OpenCV standard) >- specialized formats used in meteorology, the medical field, etc. are >not supported >- this format is specialized to images and does not attempt to solve >the more general problem of representing n-dimensional tensors in Spark > > Proposed API changes > > We propose to add a new package in the package structure, under the MLlib > project: > org.apache.spark.image > Data format > > We propose to add the following structure: > > imageSchema = StructType([ > >- StructField("mode", StringType(), False), > - The exact representation of the data. > - The values are described in the following OpenCV convention. > Basically, the type has both "depth" and "number of channels" info: in > particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format > would be CV_8UC4 (value 32 in the table) with the channel order > specified > by convention. > - The
[VOTE][SPIP] SPARK-21866 Image support in Apache Spark
Hello community, I would like to call for a vote on SPARK-21866. It is a short proposal that has important applications for image processing and deep learning. Joseph Bradley has offered to be the shepherd. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866 PDF version: https://issues.apache.org/jira/secure/attachment/12884792/SPIP %20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf Background and motivation As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers. This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions. This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines. The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead. Targets users and personas: Data scientists, data engineers, library developers. The following libraries define primitives for loading and representing images, and will gain from a common interchange format (in alphabetical order): - BigDL - DeepLearning4J - Deep Learning Pipelines - MMLSpark - TensorFlow (Spark connector) - TensorFlowOnSpark - TensorFrames - Thunder Goals: - Simple representation of images in Spark DataFrames, based on pre-existing industrial standards (OpenCV) - This format should eventually allow the development of high-performance integration points with image processing libraries such as libOpenCV, Google TensorFlow, CNTK, and other C libraries. - The reader should be able to read popular formats of images from distributed sources. Non-Goals: Images are a versatile medium and encompass a very wide range of formats and representations. This SPIP explicitly aims at the most common use case in the industry currently: multi-channel matrices of binary, int32, int64, float or double data that can fit comfortably in the heap of the JVM: - the total size of an image should be restricted to less than 2GB (roughly) - the meaning of color channels is application-specific and is not mandated by the standard (in line with the OpenCV standard) - specialized formats used in meteorology, the medical field, etc. are not supported - this format is specialized to images and does not attempt to solve the more general problem of representing n-dimensional tensors in Spark Proposed API changes We propose to add a new package in the package structure, under the MLlib project: org.apache.spark.image Data format We propose to add the following structure: imageSchema = StructType([ - StructField("mode", StringType(), False), - The exact representation of the data. - The values are described in the following OpenCV convention. Basically, the type has both "depth" and "number of channels" info: in particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 (value 32 in the table) with the channel order specified by convention. - The exact channel ordering and meaning of each channel is dictated by convention. By default, the order is RGB (3 channels) and BGRA (4 channels). If the image failed to load, the value is the empty string "". - StructField("origin", StringType(), True), - Some information about the origin of the image. The content of this is application-specific. - When the image is loaded from files, users should expect to find the file name in this field. - StructField("height", IntegerType(), False), - the height of the image, pixels - If the image fails to load, the value is -1. - StructField("width", Int