[
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817080#comment-16817080
]
Robert Joseph Evans commented on SPARK-24579:
---------------------------------------------
This SPIP SPARK-27396 covers a superset of the functionality described here.
> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> ----------------------------------------------------------------------------
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
> Issue Type: Epic
> Components: ML, PySpark, SQL
> Affects Versions: 3.0.0
> Reporter: Xiangrui Meng
> Assignee: Xiangrui Meng
> Priority: Major
> Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and
> built-in SQL, DataFrames, and Streaming features, Spark remains the community
> choice for big data. This is why we saw many efforts to integrate DL/AI
> frameworks with Spark to leverage its power, for example, TFRecords data
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark
> and external DL/AI frameworks. And the performance matters. However, there
> doesn’t exist a standard way to exchange data and hence implementation and
> performance optimization fall into pieces. For example, TensorFlowOnSpark
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format)
> between Spark and DL/AI frameworks and optimize data conversion from/to this
> interface. So DL/AI frameworks can leverage Spark to load data virtually
> from anywhere without spending extra effort building complex data solutions,
> like reading features from a production data warehouse or streaming model
> inference. Spark users can use DL/AI frameworks without learning specific
> data APIs implemented there. And developers from both sides can work on
> performance optimizations independently given the interface itself doesn’t
> introduce big overhead.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]