GitHub user husseinhazimeh opened a pull request:
https://github.com/apache/spark/pull/14101
[SPARK-16431] [ML] Add a unified method that accepts single instances to
feature transformers and predictors
## What changes were proposed in this pull request?
Current feature transformers in spark.ml can only operate on DataFrames and
don't have a method that accepts single instances. A typical transformer has a
User-Defined Function (udf) in its `transform` method which includes a set of
operations on the features of a single instance:
```
val column_operation = udf {operations on single instance}
```
Adding a new method called `transformInstance` that operates directly on
single instances and using it in the udf instead can be useful:
```
def transformInstance(features: featuresType): OutputType = {operations on
single instance}
val column_operation = udf {transformInstance}
```
Predictors also don't have a public method that does predictions on single
instances. `transformInstance` can be easily added to predictors by acting as a
wrapper for the internal method predict (which takes features as input).
Note: The proposed method in this change is added to all predictors and
feature transformers except OnehotEncoder, VectorSlicer, and Word2Vec, which
might require bigger changes due to dependencies on the dataset's schema (they
can be fixed using simple hacks but this needs to be discussed)
## Benefits
1. Providing a low-latency transformation/prediction method to support
machine learning applications that require real-time predictions. The current
`transform` method has a relatively high latency when transforming single
instances or small batches due to the overhead introduced by DataFrame
operations. I measured the latency required to classify a single instance in
the 20 Newsgroups dataset using the current `transform` method and the proposed
`transformInstance`. The ML pipeline contains a tokenizer, stopword remover,
TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the
latency percentiles in milliseconds after measuring the time to classify 700
documents.
Transformation Method | P50 | P90 | P99 | Max
--------------------- | --- | --- | --- | ---
transform | 31.44 | 39.43 | 67.75 | 126.97
transformInstance | 0.16 | 0.38 | 1.16 | 3.2
`transformInstance` is 200 times faster on average and can classify a
document in less than a millisecond. By profiling the code of `transform`, it
turns out that every transformer in the pipeline wastes 5 milliseconds on
average in DataFrame-related operations when transforming a single instance.
This implies that the latency increases linearly with the pipeline size which
can be problematic.
2. Increasing code readability and allowing easier debugging as operations
on rows are now combined into a function that can be tested independently of
the higher-level `transform` method.
3. Adding flexibility to create new models: for example, check this
[comment](https://github.com/apache/spark/pull/8883#issuecomment-215559305) on
supporting new ensemble methods.
## How was this patch tested?
The current tests for transformers and predictors, which invoke
`transformInstance` internally, passed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/husseinhazimeh/spark lowlatency
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14101.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14101
----
commit e8b3de1e599225fa71fecc17aaa34998863fb38b
Author: Hussein Hazimeh <[email protected]>
Date: 2016-07-07T20:50:22Z
Add transformInstance method to predictors and transformers
commit ca213e338bde7da2e308b2ffd9c3fa1b5d26122e
Author: Hussein Hazimeh <[email protected]>
Date: 2016-07-07T21:03:46Z
Update LogisticRegression.scala
commit 1fe5b18a0519d324ed53108ddd809a421a811f50
Author: Hussein Hazimeh <[email protected]>
Date: 2016-07-07T21:21:45Z
Update HashingTF.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]