Very nice!

Can we do Text analytics pipeline Nethaji is building using this? So it can
serve as the PoC?

Thanks
Srinath

On Wed, Sep 30, 2015 at 10:27 AM, Nirmal Fernando <[email protected]> wrote:

> spark.apache.org/docs/latest/ml-guide.html#pipeline [1]
>
> This would make us plug any Spark transformation easily and dynamically
> without the need of knowing RDD types.
>
> Same could be used to build ensemble support.
>
> [1]
> Pipeline
>
> In machine learning, it is common to run a sequence of algorithms to
> process and learn from data. E.g., a simple text document processing
> workflow might include several stages:
>
>    - Split each document’s text into words.
>    - Convert each document’s words into a numerical feature vector.
>    - Learn a prediction model using the feature vectors and labels.
>
> Spark ML represents such a workflow as a Pipeline, which consists of a
> sequence of PipelineStages (Transformers and Estimators) to be run in a
> specific order. We will use this simple workflow as a running example in
> this section.
> <http://spark.apache.org/docs/latest/ml-guide.html#how-it-works>How it
> works
>
> A Pipeline is specified as a sequence of stages, and each stage is either
> a Transformer or an Estimator. These stages are run in order, and the
> input DataFrame is transformed as it passes through each stage. For
> Transformer stages, the transform() method is called on the DataFrame.
> For Estimator stages, the fit() method is called to produce a Transformer
> (which becomes part of the PipelineModel, or fitted Pipeline), and that
> Transformer’s transform() method is called on the DataFrame.
>
> We illustrate this for the simple text document workflow. The figure below
> is for the *training time* usage of a Pipeline.
>
> [image: Spark ML Pipeline Example]
>
> Above, the top row represents a Pipeline with three stages. The first two
> (Tokenizer and HashingTF) are Transformers (blue), and the third (
> LogisticRegression) is an Estimator (red). The bottom row represents data
> flowing through the pipeline, where cylinders indicate DataFrames. The
> Pipeline.fit() method is called on the original DataFrame, which has raw
> text documents and labels. The Tokenizer.transform() method splits the
> raw text documents into words, adding a new column with words to the
> DataFrame. The HashingTF.transform() method converts the words column
> into feature vectors, adding a new column with those vectors to the
> DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline
> first calls LogisticRegression.fit() to produce a LogisticRegressionModel.
> If the Pipeline had more stages, it would call the LogisticRegressionModel’s
> transform() method on the DataFrame before passing the DataFrame to the
> next stage.
>
> A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs,
> it produces a PipelineModel, which is a Transformer. This PipelineModel
> is used at *test time*; the figure below illustrates this usage.
>
> [image: Spark ML PipelineModel Example]
>
> In the figure above, the PipelineModel has the same number of stages as
> the original Pipeline, but all Estimators in the original Pipeline have
> become Transformers. When the PipelineModel’s transform() method is
> called on a test dataset, the data are passed through the fitted pipeline
> in order. Each stage’s transform() method updates the dataset and passes
> it to the next stage.
>
> Pipelines and PipelineModels help to ensure that training and test data
> go through identical feature processing steps.
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 
============================
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to