Very nice! Can we do Text analytics pipeline Nethaji is building using this? So it can serve as the PoC?
Thanks Srinath On Wed, Sep 30, 2015 at 10:27 AM, Nirmal Fernando <[email protected]> wrote: > spark.apache.org/docs/latest/ml-guide.html#pipeline [1] > > This would make us plug any Spark transformation easily and dynamically > without the need of knowing RDD types. > > Same could be used to build ensemble support. > > [1] > Pipeline > > In machine learning, it is common to run a sequence of algorithms to > process and learn from data. E.g., a simple text document processing > workflow might include several stages: > > - Split each document’s text into words. > - Convert each document’s words into a numerical feature vector. > - Learn a prediction model using the feature vectors and labels. > > Spark ML represents such a workflow as a Pipeline, which consists of a > sequence of PipelineStages (Transformers and Estimators) to be run in a > specific order. We will use this simple workflow as a running example in > this section. > <http://spark.apache.org/docs/latest/ml-guide.html#how-it-works>How it > works > > A Pipeline is specified as a sequence of stages, and each stage is either > a Transformer or an Estimator. These stages are run in order, and the > input DataFrame is transformed as it passes through each stage. For > Transformer stages, the transform() method is called on the DataFrame. > For Estimator stages, the fit() method is called to produce a Transformer > (which becomes part of the PipelineModel, or fitted Pipeline), and that > Transformer’s transform() method is called on the DataFrame. > > We illustrate this for the simple text document workflow. The figure below > is for the *training time* usage of a Pipeline. > > [image: Spark ML Pipeline Example] > > Above, the top row represents a Pipeline with three stages. The first two > (Tokenizer and HashingTF) are Transformers (blue), and the third ( > LogisticRegression) is an Estimator (red). The bottom row represents data > flowing through the pipeline, where cylinders indicate DataFrames. The > Pipeline.fit() method is called on the original DataFrame, which has raw > text documents and labels. The Tokenizer.transform() method splits the > raw text documents into words, adding a new column with words to the > DataFrame. The HashingTF.transform() method converts the words column > into feature vectors, adding a new column with those vectors to the > DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline > first calls LogisticRegression.fit() to produce a LogisticRegressionModel. > If the Pipeline had more stages, it would call the LogisticRegressionModel’s > transform() method on the DataFrame before passing the DataFrame to the > next stage. > > A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, > it produces a PipelineModel, which is a Transformer. This PipelineModel > is used at *test time*; the figure below illustrates this usage. > > [image: Spark ML PipelineModel Example] > > In the figure above, the PipelineModel has the same number of stages as > the original Pipeline, but all Estimators in the original Pipeline have > become Transformers. When the PipelineModel’s transform() method is > called on a test dataset, the data are passed through the fitted pipeline > in order. Each stage’s transform() method updates the dataset and passes > it to the next stage. > > Pipelines and PipelineModels help to ensure that training and test data > go through identical feature processing steps. > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- ============================ Blog: http://srinathsview.blogspot.com twitter:@srinath_perera Site: http://people.apache.org/~hemapani/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
