[ 
https://issues.apache.org/jira/browse/SPARK-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hollin Wilkins updated SPARK-9084:
----------------------------------
    Description: 
Currently ML provides excellent support for feature manipulation, model 
selection, and prediction for large datasets. The models can all be easily 
serialized but currently it is not possible to use the fitted models without a 
DataFrame. This means that these models are only good for batch processing. In 
order to support realtime ML pipelines, I propose adding in three new methods 
to the Transformer class:

def transform(row: StructuredRow): StructuredRow
def transform(row: StructuredRow, paramMap: ParamMap): StructuredRow
def transform(row: StructuredRow, firstParamPair: ParamPair[_], 
otherParamPairs: ParamPair[_]*): StructuredRow

Where a StructuredRow is a case class that is the combination of an 
org.apache.spark.sql.Row and an org.apache.spark.sql.types.StructType. An 
alternative would be to modify the transform method signature to take in two 
objects, a StructType and a Row.

This change necessitates the addition of the new transform method to each 
implementor of the Transformer class.

Following this change, it would be trivial to include the spark jars in an API 
server, deserialize an ML PipelineModel object, take incoming data from users, 
convert it into a StructuredRow and feed it into the PipelineModel to get a 
realtime result.

  was:
Currently ML provides excellent support for feature manipulation, model 
selection, and prediction for large datasets. The models can all be easily 
serialized but currently it is not possible to use the fitted models without a 
DataFrame. This means that these models are only good for batch processing. In 
order to support realtime ML pipelines, I propose adding in three new methods 
to the Transformer class:

def transform(row: StructuredRow): StructuredRow
def transform(row: StructuredRow, paramMap: ParamMap): StructuredRow
def transform(row: StructuredRow, firstParamPair: ParamPair[_], 
otherParamPairs: ParamPair[_]*): StructuredRow

Where a StructuredRow is a case class that is the combination of an 
org.apache.spark.sql.Row and an org.apache.spark.sql.types.StructType

This change necessitates the addition of the new transform method to each 
implementor of the Transformer class.

Following this change, it would be trivial to include the spark jars in an API 
server, deserialize an ML PipelineModel object, take incoming data from users, 
convert it into a StructuredRow and feed it into the PipelineModel to get a 
realtime result.


> Add in support for realtime data predictions using ML PipelineModel
> -------------------------------------------------------------------
>
>                 Key: SPARK-9084
>                 URL: https://issues.apache.org/jira/browse/SPARK-9084
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Hollin Wilkins
>            Priority: Critical
>
> Currently ML provides excellent support for feature manipulation, model 
> selection, and prediction for large datasets. The models can all be easily 
> serialized but currently it is not possible to use the fitted models without 
> a DataFrame. This means that these models are only good for batch processing. 
> In order to support realtime ML pipelines, I propose adding in three new 
> methods to the Transformer class:
> def transform(row: StructuredRow): StructuredRow
> def transform(row: StructuredRow, paramMap: ParamMap): StructuredRow
> def transform(row: StructuredRow, firstParamPair: ParamPair[_], 
> otherParamPairs: ParamPair[_]*): StructuredRow
> Where a StructuredRow is a case class that is the combination of an 
> org.apache.spark.sql.Row and an org.apache.spark.sql.types.StructType. An 
> alternative would be to modify the transform method signature to take in two 
> objects, a StructType and a Row.
> This change necessitates the addition of the new transform method to each 
> implementor of the Transformer class.
> Following this change, it would be trivial to include the spark jars in an 
> API server, deserialize an ML PipelineModel object, take incoming data from 
> users, convert it into a StructuredRow and feed it into the PipelineModel to 
> get a realtime result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to