incubator-predictionio#d8ee0c8ffdd27d3f2bbe9560b229bc36ee966f9d

git-site-role Thu, 05 Oct 2017 22:30:15 -0700

http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/3897c890/demo/textclassification/index.html
----------------------------------------------------------------------
diff --git a/demo/textclassification/index.html 
b/demo/textclassification/index.html
new file mode 100644
index 0000000..daf64d7
--- /dev/null
+++ b/demo/textclassification/index.html
@@ -0,0 +1,555 @@
+<!DOCTYPE html><html><head><title>Text Classification Engine 
Tutorial</title><meta charset="utf-8"/><meta content="IE=edge,chrome=1" 
http-equiv="X-UA-Compatible"/><meta name="viewport" 
content="width=device-width, initial-scale=1.0"/><meta class="swiftype" 
name="title" data-type="string" content="Text Classification Engine 
Tutorial"/><link rel="canonical" 
href="https://predictionio.incubator.apache.org/demo/textclassification/"/><link
 href="/images/favicon/normal-b330020a.png" rel="shortcut icon"/><link 
href="/images/favicon/apple-c0febcf2.png" rel="apple-touch-icon"/><link 
href="//fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800"
 rel="stylesheet"/><link 
href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" 
rel="stylesheet"/><link href="/stylesheets/application-3a3867f7.css" 
rel="stylesheet" type="text/css"/><script 
src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.2/html5shiv.min.js"></script><s
 cript 
src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script
 src="//use.typekit.net/pqo0itb.js"></script><script>try{Typekit.load({ async: 
true });}catch(e){}</script></head><body><div id="global"><header><div 
class="container" id="header-wrapper"><div class="row"><div 
class="col-sm-12"><div id="logo-wrapper"><span id="drawer-toggle"></span><a 
href="#"></a><a href="http://predictionio.incubator.apache.org/";><img 
alt="PredictionIO" id="logo" 
src="/images/logos/logo-ee2b9bb3.png"/></a></div><div id="menu-wrapper"><div 
id="pill-wrapper"><a class="pill left" 
href="/gallery/template-gallery">TEMPLATES</a> <a class="pill right" 
href="//github.com/apache/incubator-predictionio/">OPEN 
SOURCE</a></div></div><img class="mobile-search-bar-toggler hidden-md 
hidden-lg" 
src="/images/icons/search-glass-704bd4ff.png"/></div></div></div></header><div 
id="search-bar-row-wrapper"><div class="container-fluid" 
id="search-bar-row"><div class="row"><div class="col-
 md-9 col-sm-11 col-xs-11"><div class="hidden-md hidden-lg" 
id="mobile-page-heading-wrapper"><p>PredictionIO Docs</p><h4>Text 
Classification Engine Tutorial</h4></div><h4 class="hidden-sm 
hidden-xs">PredictionIO Docs</h4></div><div class="col-md-3 col-sm-1 col-xs-1 
hidden-md hidden-lg"><img id="left-menu-indicator" 
src="/images/icons/down-arrow-dfe9f7fe.png"/></div><div class="col-md-3 
col-sm-12 col-xs-12 swiftype-wrapper"><div class="swiftype"><form 
class="search-form"><img class="search-box-toggler hidden-xs hidden-sm" 
src="/images/icons/search-glass-704bd4ff.png"/><div class="search-box"><img 
src="/images/icons/search-glass-704bd4ff.png"/><input type="text" 
id="st-search-input" class="st-search-input" placeholder="Search 
Doc..."/></div><img class="swiftype-row-hider hidden-md hidden-lg" 
src="/images/icons/drawer-toggle-active-fcbef12a.png"/></form></div></div><div 
class="mobile-left-menu-toggler hidden-md 
hidden-lg"></div></div></div></div><div id="page" class="container-fluid"><d
 iv class="row"><div id="left-menu-wrapper" class="col-md-3"><nav 
id="nav-main"><ul><li class="level-1"><a class="expandible" 
href="/"><span>Apache PredictionIO (incubating) Documentation</span></a><ul><li 
class="level-2"><a class="final" href="/"><span>Welcome to Apache PredictionIO 
(incubating)</span></a></li></ul></li><li class="level-1"><a class="expandible" 
href="#"><span>Getting Started</span></a><ul><li class="level-2"><a 
class="final" href="/start/"><span>A Quick Intro</span></a></li><li 
class="level-2"><a class="final" href="/install/"><span>Installing Apache 
PredictionIO (incubating)</span></a></li><li class="level-2"><a class="final" 
href="/start/download/"><span>Downloading an Engine Template</span></a></li><li 
class="level-2"><a class="final" href="/start/deploy/"><span>Deploying Your 
First Engine</span></a></li><li class="level-2"><a class="final" 
href="/start/customize/"><span>Customizing the 
Engine</span></a></li></ul></li><li class="level-1"><a class="expandible" hre
 f="#"><span>Integrating with Your App</span></a><ul><li class="level-2"><a 
class="final" href="/appintegration/"><span>App Integration 
Overview</span></a></li><li class="level-2"><a class="expandible" 
href="/sdk/"><span>List of SDKs</span></a><ul><li class="level-3"><a 
class="final" href="/sdk/java/"><span>Java & Android SDK</span></a></li><li 
class="level-3"><a class="final" href="/sdk/php/"><span>PHP 
SDK</span></a></li><li class="level-3"><a class="final" 
href="/sdk/python/"><span>Python SDK</span></a></li><li class="level-3"><a 
class="final" href="/sdk/ruby/"><span>Ruby SDK</span></a></li><li 
class="level-3"><a class="final" href="/sdk/community/"><span>Community Powered 
SDKs</span></a></li></ul></li></ul></li><li class="level-1"><a 
class="expandible" href="#"><span>Deploying an Engine</span></a><ul><li 
class="level-2"><a class="final" href="/deploy/"><span>Deploying as a Web 
Service</span></a></li><li class="level-2"><a class="final" 
href="/batchpredict/"><span>Batch Predictions
 </span></a></li><li class="level-2"><a class="final" 
href="/deploy/monitoring/"><span>Monitoring Engine</span></a></li><li 
class="level-2"><a class="final" href="/deploy/engineparams/"><span>Setting 
Engine Parameters</span></a></li><li class="level-2"><a class="final" 
href="/deploy/enginevariants/"><span>Deploying Multiple Engine 
Variants</span></a></li><li class="level-2"><a class="final" 
href="/deploy/plugin/"><span>Engine Server Plugin</span></a></li></ul></li><li 
class="level-1"><a class="expandible" href="#"><span>Customizing an 
Engine</span></a><ul><li class="level-2"><a class="final" 
href="/customize/"><span>Learning DASE</span></a></li><li class="level-2"><a 
class="final" href="/customize/dase/"><span>Implement DASE</span></a></li><li 
class="level-2"><a class="final" 
href="/customize/troubleshooting/"><span>Troubleshooting Engine 
Development</span></a></li><li class="level-2"><a class="final" 
href="/api/current/#package"><span>Engine Scala 
APIs</span></a></li></ul></li><li c
 lass="level-1"><a class="expandible" href="#"><span>Collecting and Analyzing 
Data</span></a><ul><li class="level-2"><a class="final" 
href="/datacollection/"><span>Event Server Overview</span></a></li><li 
class="level-2"><a class="final" 
href="/datacollection/eventapi/"><span>Collecting Data with 
REST/SDKs</span></a></li><li class="level-2"><a class="final" 
href="/datacollection/eventmodel/"><span>Events Modeling</span></a></li><li 
class="level-2"><a class="final" 
href="/datacollection/webhooks/"><span>Unifying Multichannel Data with 
Webhooks</span></a></li><li class="level-2"><a class="final" 
href="/datacollection/channel/"><span>Channel</span></a></li><li 
class="level-2"><a class="final" 
href="/datacollection/batchimport/"><span>Importing Data in 
Batch</span></a></li><li class="level-2"><a class="final" 
href="/datacollection/analytics/"><span>Using Analytics 
Tools</span></a></li><li class="level-2"><a class="final" 
href="/datacollection/plugin/"><span>Event Server Plugin</span></a>
 </li></ul></li><li class="level-1"><a class="expandible" 
href="#"><span>Choosing an Algorithm(s)</span></a><ul><li class="level-2"><a 
class="final" href="/algorithm/"><span>Built-in Algorithm 
Libraries</span></a></li><li class="level-2"><a class="final" 
href="/algorithm/switch/"><span>Switching to Another 
Algorithm</span></a></li><li class="level-2"><a class="final" 
href="/algorithm/multiple/"><span>Combining Multiple 
Algorithms</span></a></li><li class="level-2"><a class="final" 
href="/algorithm/custom/"><span>Adding Your Own 
Algorithms</span></a></li></ul></li><li class="level-1"><a class="expandible" 
href="#"><span>ML Tuning and Evaluation</span></a><ul><li class="level-2"><a 
class="final" href="/evaluation/"><span>Overview</span></a></li><li 
class="level-2"><a class="final" 
href="/evaluation/paramtuning/"><span>Hyperparameter Tuning</span></a></li><li 
class="level-2"><a class="final" 
href="/evaluation/evaluationdashboard/"><span>Evaluation 
Dashboard</span></a></li><li class="lev
 el-2"><a class="final" href="/evaluation/metricchoose/"><span>Choosing 
Evaluation Metrics</span></a></li><li class="level-2"><a class="final" 
href="/evaluation/metricbuild/"><span>Building Evaluation 
Metrics</span></a></li></ul></li><li class="level-1"><a class="expandible" 
href="#"><span>System Architecture</span></a><ul><li class="level-2"><a 
class="final" href="/system/"><span>Architecture Overview</span></a></li><li 
class="level-2"><a class="final" href="/system/anotherdatastore/"><span>Using 
Another Data Store</span></a></li></ul></li><li class="level-1"><a 
class="expandible" href="#"><span>PredictionIO Official 
Templates</span></a><ul><li class="level-2"><a class="final" 
href="/templates/"><span>Intro</span></a></li><li class="level-2"><a 
class="expandible" href="#"><span>Recommendation</span></a><ul><li 
class="level-3"><a class="final" 
href="/templates/recommendation/quickstart/"><span>Quick 
Start</span></a></li><li class="level-3"><a class="final" 
href="/templates/recommenda
 tion/dase/"><span>DASE</span></a></li><li class="level-3"><a class="final" 
href="/templates/recommendation/evaluation/"><span>Evaluation 
Explained</span></a></li><li class="level-3"><a class="final" 
href="/templates/recommendation/how-to/"><span>How-To</span></a></li><li 
class="level-3"><a class="final" 
href="/templates/recommendation/reading-custom-events/"><span>Read Custom 
Events</span></a></li><li class="level-3"><a class="final" 
href="/templates/recommendation/customize-data-prep/"><span>Customize Data 
Preparator</span></a></li><li class="level-3"><a class="final" 
href="/templates/recommendation/customize-serving/"><span>Customize 
Serving</span></a></li><li class="level-3"><a class="final" 
href="/templates/recommendation/training-with-implicit-preference/"><span>Train 
with Implicit Preference</span></a></li><li class="level-3"><a class="final" 
href="/templates/recommendation/blacklist-items/"><span>Filter Recommended 
Items by Blacklist in Query</span></a></li><li class="level-3
 "><a class="final" 
href="/templates/recommendation/batch-evaluator/"><span>Batch Persistable 
Evaluator</span></a></li></ul></li><li class="level-2"><a class="expandible" 
href="#"><span>E-Commerce Recommendation</span></a><ul><li class="level-3"><a 
class="final" href="/templates/ecommercerecommendation/quickstart/"><span>Quick 
Start</span></a></li><li class="level-3"><a class="final" 
href="/templates/ecommercerecommendation/dase/"><span>DASE</span></a></li><li 
class="level-3"><a class="final" 
href="/templates/ecommercerecommendation/how-to/"><span>How-To</span></a></li><li
 class="level-3"><a class="final" 
href="/templates/ecommercerecommendation/train-with-rate-event/"><span>Train 
with Rate Event</span></a></li><li class="level-3"><a class="final" 
href="/templates/ecommercerecommendation/adjust-score/"><span>Adjust 
Score</span></a></li></ul></li><li class="level-2"><a class="expandible" 
href="#"><span>Similar Product</span></a><ul><li class="level-3"><a 
class="final" href="/templates
 /similarproduct/quickstart/"><span>Quick Start</span></a></li><li 
class="level-3"><a class="final" 
href="/templates/similarproduct/dase/"><span>DASE</span></a></li><li 
class="level-3"><a class="final" 
href="/templates/similarproduct/how-to/"><span>How-To</span></a></li><li 
class="level-3"><a class="final" 
href="/templates/similarproduct/multi-events-multi-algos/"><span>Multiple 
Events and Multiple Algorithms</span></a></li><li class="level-3"><a 
class="final" 
href="/templates/similarproduct/return-item-properties/"><span>Returns Item 
Properties</span></a></li><li class="level-3"><a class="final" 
href="/templates/similarproduct/train-with-rate-event/"><span>Train with Rate 
Event</span></a></li><li class="level-3"><a class="final" 
href="/templates/similarproduct/rid-user-set-event/"><span>Get Rid of Events 
for Users</span></a></li><li class="level-3"><a class="final" 
href="/templates/similarproduct/recommended-user/"><span>Recommend 
Users</span></a></li></ul></li><li class="level-2"><
 a class="expandible" href="#"><span>Classification</span></a><ul><li 
class="level-3"><a class="final" 
href="/templates/classification/quickstart/"><span>Quick 
Start</span></a></li><li class="level-3"><a class="final" 
href="/templates/classification/dase/"><span>DASE</span></a></li><li 
class="level-3"><a class="final" 
href="/templates/classification/how-to/"><span>How-To</span></a></li><li 
class="level-3"><a class="final" 
href="/templates/classification/add-algorithm/"><span>Use Alternative 
Algorithm</span></a></li><li class="level-3"><a class="final" 
href="/templates/classification/reading-custom-properties/"><span>Read Custom 
Properties</span></a></li></ul></li></ul></li><li class="level-1"><a 
class="expandible" href="#"><span>Engine Template Gallery</span></a><ul><li 
class="level-2"><a class="final" 
href="/gallery/template-gallery/"><span>Browse</span></a></li><li 
class="level-2"><a class="final" 
href="/community/submit-template/"><span>Submit your Engine as a 
Template</span></a><
 /li></ul></li><li class="level-1"><a class="expandible" href="#"><span>Demo 
Tutorials</span></a><ul><li class="level-2"><a class="final" 
href="/demo/tapster/"><span>Comics Recommendation Demo</span></a></li><li 
class="level-2"><a class="final" href="/demo/community/"><span>Community 
Contributed Demo</span></a></li><li class="level-2"><a class="final active" 
href="/demo/textclassification/"><span>Text Classification Engine 
Tutorial</span></a></li></ul></li><li class="level-1"><a class="expandible" 
href="/community/"><span>Getting Involved</span></a><ul><li class="level-2"><a 
class="final" href="/community/contribute-code/"><span>Contribute 
Code</span></a></li><li class="level-2"><a class="final" 
href="/community/contribute-documentation/"><span>Contribute 
Documentation</span></a></li><li class="level-2"><a class="final" 
href="/community/contribute-sdk/"><span>Contribute a SDK</span></a></li><li 
class="level-2"><a class="final" 
href="/community/contribute-webhook/"><span>Contribute a 
 Webhook</span></a></li><li class="level-2"><a class="final" 
href="/community/projects/"><span>Community 
Projects</span></a></li></ul></li><li class="level-1"><a class="expandible" 
href="#"><span>Getting Help</span></a><ul><li class="level-2"><a class="final" 
href="/resources/faq/"><span>FAQs</span></a></li><li class="level-2"><a 
class="final" href="/support/"><span>Support</span></a></li></ul></li><li 
class="level-1"><a class="expandible" 
href="#"><span>Resources</span></a><ul><li class="level-2"><a class="final" 
href="/cli/"><span>Command-line Interface</span></a></li><li class="level-2"><a 
class="final" href="/resources/release/"><span>Release 
Cadence</span></a></li><li class="level-2"><a class="final" 
href="/resources/intellij/"><span>Developing Engines with IntelliJ 
IDEA</span></a></li><li class="level-2"><a class="final" 
href="/resources/upgrade/"><span>Upgrade Instructions</span></a></li><li 
class="level-2"><a class="final" 
href="/resources/glossary/"><span>Glossary</span></a>
 </li></ul></li><li class="level-1"><a class="expandible" href="#"><span>Apache 
Software Foundation</span></a><ul><li class="level-2"><a class="final" 
href="https://www.apache.org/";><span>Apache Homepage</span></a></li><li 
class="level-2"><a class="final" 
href="https://www.apache.org/licenses/";><span>License</span></a></li><li 
class="level-2"><a class="final" 
href="https://www.apache.org/foundation/sponsorship.html";><span>Sponsorship</span></a></li><li
 class="level-2"><a class="final" 
href="https://www.apache.org/foundation/thanks.html";><span>Thanks</span></a></li><li
 class="level-2"><a class="final" 
href="https://www.apache.org/security/";><span>Security</span></a></li></ul></li></ul></nav></div><div
 class="col-md-9 col-sm-12"><div class="content-header hidden-md 
hidden-lg"><div id="breadcrumbs" class="hidden-sm hidden xs"><ul><li><a 
href="#">Demo Tutorials</a><span class="spacer">&gt;</span></li><li><span 
class="last">Text Classification Engine Tutorial</span></li></ul></div><div id
 ="page-title"><h1>Text Classification Engine Tutorial</h1></div></div><div 
id="table-of-content-wrapper"><h5>On this page</h5><aside 
id="table-of-contents"><ul> <li> <a href="#introduction">Introduction</a> </li> 
<li> <a href="#prerequisites">Prerequisites</a> </li> <li> <a 
href="#engine-overview">Engine Overview</a> </li> <li> <a 
href="#quick-start">Quick Start</a> </li> </ul> </li> <li> <a 
href="#detailed-explanation-of-dase">Detailed Explanation of DASE</a> <ul> <li> 
<a href="#importing-data">Importing Data</a> </li> <li> <a 
href="#data-source-reading-event-data">Data Source: Reading Event Data</a> 
</li> <li> <a href="#preparator-data-processing-with-dase">Preparator : Data 
Processing With DASE</a> </li> <li> <a href="#algorithm-component">Algorithm 
Component</a> </li> <li> <a 
href="#serving-delivering-the-final-prediction">Serving: Delivering the Final 
Prediction</a> </li> <li> <a 
href="#evaluation-model-assessment-and-selection">Evaluation: Model Assessment 
and Selection</a> </
 li> <li> <a href="#engine-deployment">Engine Deployment</a> </li> </ul> 
</aside><hr/><a id="edit-page-link" 
href="https://github.com/apache/incubator-predictionio/tree/livedoc/docs/manual/source/demo/textclassification.html.md.erb";><img
 src="/images/icons/edit-pencil-d6c1bb3d.png"/>Edit this page</a></div><div 
class="content-header hidden-sm hidden-xs"><div id="breadcrumbs" 
class="hidden-sm hidden xs"><ul><li><a href="#">Demo Tutorials</a><span 
class="spacer">&gt;</span></li><li><span class="last">Text Classification 
Engine Tutorial</span></li></ul></div><div id="page-title"><h1>Text 
Classification Engine Tutorial</h1></div></div><div class="content"> 
<p>(Updated for Text Classification Template version 3.1)</p><h2 
id='introduction' class='header-anchors'>Introduction</h2><p>In the real world, 
there are many applications that collect text as data. For example, spam 
detectors take email and header content to automatically determine what is or 
is not spam; applications can gague the g
 eneral sentiment in a geographical area by analyzing Twitter data; and news 
articles can be automatically categorized based solely on the text 
content.There are a wide array of machine learning models you can use to 
create, or train, a predictive model to assign an incoming article, or query, 
to an existing category. Before you can use these techniques you must first 
transform the text data (in this case the set of news articles) into numeric 
vectors, or feature vectors, that can be used to train your model.</p><p>The 
purpose of this tutorial is to illustrate how you can go about doing this using 
PredictionIO&#39;s platform. The advantages of using this platform include: a 
dynamic engine that responds to queries in real-time; <a 
href="http://en.wikipedia.org/wiki/Separation_of_concerns";>separation of 
concerns</a>, which offers code re-use and maintainability, and distributed 
computing capabilities for scalability and efficiency. Moreover, it is easy to 
incorporate non-trivial data m
 odeling tasks into the DASE architecture allowing Data Scientists to focus on 
tasks related to modeling. This tutorial will exemplify some of these ideas by 
guiding you through PredictionIO&#39;s <a 
href="/gallery/template-gallery/#natural-language-processing">text 
classification template</a>.</p><h2 id='prerequisites' 
class='header-anchors'>Prerequisites</h2><p>Before getting started, please make 
sure that you have the latest version of Apache PredictionIO (incubating) <a 
href="http://predictionio.incubator.apache.org/install/";>installed</a>. We 
emphasize here that this is an engine template written in 
<strong>Scala</strong> and can be more generally thought of as an SBT project 
containing all the necessary components.</p><p>You should also download the 
engine template named Text Classification Engine that accompanies this tutorial 
by cloning the template repository:</p><div class="highlight shell"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align:
  right"><pre class="lineno">1</pre></td><td class="code"><pre>git clone 
https://github.com/apache/incubator-predictionio-template-text-classifier.git 
&lt; Your new engine directory &gt;
+</pre></td></tr></tbody></table> </div> <h2 id='engine-overview' 
class='header-anchors'>Engine Overview</h2><p>The engine follows the DASE 
architecture which we briefly review here. As a user, you are tasked with 
collecting data for your web or application, and importing it into 
PredictionIO&#39;s Event Server. Once the data is in the server, it can be read 
and processed by the engine via the Data Source and Preparation components, 
respectively. The Algorithm component uses the processed, or prepared, data to 
train a set of predictive models. Once you have trained these models, you are 
ready to deploy your engine and respond to real-time queries via the Serving 
component which combines the results from different fitted models. The 
Evaluation component is used to compute an appropriate metric to test the 
performance of a fitted model, as well as aid in the tuning of model hyper 
parameters.</p><p>This engine template is meant to handle text classification 
which means you will be worki
 ng with text data. This means that a query, or newly observed documents, will 
be of the form:</p><p><code>{text : String}</code>.</p><p>In the running 
example, a query would be an incoming news article. Once the engine is deployed 
it can process the query, and then return a Predicted Result of the 
form</p><p><code>{category : String, confidence : Double}</code>.</p><p>Here 
category is the model&#39;s class assignment for this new text document (i.e. 
the best guess for this article&#39;s categorization), and confidence, a value 
between 0 and 1 representing your confidence in the category prediction (0 
meaning you have no confidence in the prediction). The Actual Result is of the 
form</p><p><code>{category : String}</code>.</p><p>This is used in the 
evaluation stage when estimating the performance of your predictive model (how 
well does the model predict categories). Please refer to the <a 
href="https://predictionio.incubator.apache.org/customize/";>following 
tutorial</a> for a more de
 tailed explanation of how your engine will interact with your web application, 
as well as an in depth-overview of DASE.</p><h2 id='quick-start' 
class='header-anchors'>Quick Start</h2><p>This is a quick start guide in case 
you want to start using the engine right away. Sample email data for spam 
classification will be used. For more detailed information, read the subsequent 
sections.</p><h3 id='1.-create-a-new-application.' class='header-anchors'>1. 
Create a new application.</h3><p>After the application is created, you will be 
given an access key and application ID for the application.</p><div 
class="highlight shell"><table style="border-spacing: 0"><tbody><tr><td 
class="gutter gl" style="text-align: right"><pre class="lineno">1</pre></td><td 
class="code"><pre><span class="gp">$ </span>pio app new MyTextApp
+</pre></td></tr></tbody></table> </div> <h3 id='2.-import-the-tutorial-data.' 
class='header-anchors'>2. Import the tutorial data.</h3><p>There are three 
different data sets available, each giving a different use case for this 
engine. Please refer to the <strong>Data Source: Reading Event Data</strong> 
section to see how to appropriate modify the <code>DataSource</code> class for 
use with each respective data set. The default data set is an e-mail spam data 
set.</p><p>These data sets have already been processed and are ready for <a 
href="/datacollection/batchimport/">batch import</a>. Replace <code>***</code> 
with your actual application ID.</p><div class="highlight shell"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1
+2
+3</pre></td><td class="code"><pre><span class="gp">$ </span>pio import --appid 
<span class="k">***</span> --input data/stopwords.json
+
+<span class="gp">$ </span>pio import --appid <span class="k">***</span> 
--input data/emails.json
+</pre></td></tr></tbody></table> </div> <h3 
id='3.-set-the-engine-parameters-in-the-file-<code>engine.json</code>.' 
class='header-anchors' >3. Set the engine parameters in the file 
<code>engine.json</code>.</h3><p>The default settings are shown below. By 
default, it uses the algorithm name &quot;lr&quot; which is logstic regression. 
Please see later section for more detailed explanation of engine.json 
setting.</p><p>Make sure the &quot;appName&quot; is same as the app you created 
in step1.</p><div class="highlight shell"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26</pre></td><td class="code"><pre><span class="o">{</span>
+  <span class="s2">"id"</span>: <span class="s2">"default"</span>,
+  <span class="s2">"description"</span>: <span class="s2">"Default 
settings"</span>,
+  <span class="s2">"engineFactory"</span>: <span 
class="s2">"org.template.textclassification.TextClassificationEngine"</span>,
+  <span class="s2">"datasource"</span>: <span class="o">{</span>
+    <span class="s2">"params"</span>: <span class="o">{</span>
+      <span class="s2">"appName"</span>: <span class="s2">"MyTextApp"</span>
+    <span class="o">}</span>
+  <span class="o">}</span>,
+  <span class="s2">"preparator"</span>: <span class="o">{</span>
+    <span class="s2">"params"</span>: <span class="o">{</span>
+      <span class="s2">"nGram"</span>: 1,
+      <span class="s2">"numFeatures"</span>: 500,
+      <span class="s2">"SPPMI"</span>: <span class="nb">false</span>
+    <span class="o">}</span>
+  <span class="o">}</span>,
+  <span class="s2">"algorithms"</span>: <span class="o">[</span>
+    <span class="o">{</span>
+      <span class="s2">"name"</span>: <span class="s2">"lr"</span>,
+      <span class="s2">"params"</span>: <span class="o">{</span>
+        <span class="s2">"regParam"</span>: 0.00000005
+      <span class="o">}</span>
+    <span class="o">}</span>
+  <span class="o">]</span>
+<span class="o">}</span>
+
+</pre></td></tr></tbody></table> </div> <h3 id='4.-build-your-engine.' 
class='header-anchors'>4. Build your engine.</h3><div class="highlight 
shell"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" 
style="text-align: right"><pre class="lineno">1</pre></td><td 
class="code"><pre><span class="gp">$ </span>pio build --verbose
+</pre></td></tr></tbody></table> </div> <p>This command should take few 
minutes for the first time; all subsequent builds should be less than a minute. 
You can also run it without <code>--verbose</code> if you don&#39;t want to see 
all the log messages.</p><p>Upon successful build, you should see a console 
message similar to the following</p><div class="highlight shell"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1
+2</pre></td><td class="code"><pre><span class="o">[</span>INFO] <span 
class="o">[</span>RegisterEngine<span class="nv">$]</span> Registering engine 
6wxDy2hxLbvaMJra927ahFdQHDIVXeQz 266bae678c570dee58154b2338cef7aa1646e0d3
+<span class="o">[</span>INFO] <span class="o">[</span>Console<span 
class="nv">$]</span> Your engine is ready <span class="k">for </span>training.
+</pre></td></tr></tbody></table> </div> <h3 
id='5.a.-train-your-model-and-deploy.' class='header-anchors'>5.a. Train your 
model and deploy.</h3><div class="highlight shell"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1</pre></td><td class="code"><pre><span class="gp">$ 
</span>pio train
+</pre></td></tr></tbody></table> </div> <p>When your engine is trained 
successfully, you should see a console message similar to the 
following.</p><div class="highlight shell"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1</pre></td><td class="code"><pre><span class="o">[</span>INFO] 
<span class="o">[</span>CoreWorkflow<span class="nv">$]</span> Training 
completed successfully.
+</pre></td></tr></tbody></table> </div> <p>Now your engine is ready to deploy. 
Run:</p><div class="highlight shell"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1</pre></td><td class="code"><pre><span class="gp">$ </span>pio 
deploy
+</pre></td></tr></tbody></table> </div> <p>When the engine is deployed 
successfully and running, you should see a console message similar to the 
following:</p><div class="highlight shell"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1
+2</pre></td><td class="code"><pre><span class="o">[</span>INFO] <span 
class="o">[</span>HttpListener] Bound to /0.0.0.0:8000
+<span class="o">[</span>INFO] <span class="o">[</span>MasterActor] Engine is 
deployed and running. Engine API is live at http://0.0.0.0:8000.
+</pre></td></tr></tbody></table> </div> <p>Now you can send query to the 
engine. Open another terminal and send the following http request to the 
deployed engine:</p><div class="highlight shell"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1</pre></td><td class="code"><pre><span class="gp">$ </span>curl 
-H <span class="s2">"Content-Type: application/json"</span> -d <span 
class="s1">'{ "text":"I like speed and fast motorcycles." }'</span> 
http://localhost:8000/queries.json
+</pre></td></tr></tbody></table> </div> <p>you should see following outputs 
returned by the engine:</p><div class="highlight shell"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1</pre></td><td class="code"><pre><span 
class="o">{</span><span class="s2">"category"</span>:<span class="s2">"not 
spam"</span>,<span class="s2">"confidence"</span>:0.852619510921587<span 
class="o">}</span>
+</pre></td></tr></tbody></table> </div> <p>Try another query:</p><div 
class="highlight shell"><table style="border-spacing: 0"><tbody><tr><td 
class="gutter gl" style="text-align: right"><pre class="lineno">1</pre></td><td 
class="code"><pre><span class="gp">$ </span>curl -H <span 
class="s2">"Content-Type: application/json"</span> -d <span class="s1">'{ 
"text":"Earn extra cash!" }'</span> http://localhost:8000/queries.json
+</pre></td></tr></tbody></table> </div> <p>you should see following outputs 
returned by the engine:</p><div class="highlight shell"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1</pre></td><td class="code"><pre><span 
class="o">{</span><span class="s2">"category"</span>:<span 
class="s2">"spam"</span>,<span 
class="s2">"confidence"</span>:0.5268770133242983<span class="o">}</span>
+</pre></td></tr></tbody></table> </div> <h3 
id='5.b.evaluate-your-training-model-and-tune-parameters.' 
class='header-anchors'>5.b.Evaluate your training model and tune 
parameters.</h3><div class="highlight shell"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1</pre></td><td class="code"><pre><span class="gp">$ </span>pio 
<span class="nb">eval </span>org.template.textclassification.AccuracyEvaluation 
org.template.textclassification.EngineParamsList
+</pre></td></tr></tbody></table> </div> <p><strong>Note:</strong> Training and 
evaluation stages are generally different stages of engine development. 
Evaluation is there to help you choose the best <a 
href="/evaluation/paramtuning/">algorithm parameters</a> to use for training an 
engine that is to be deployed as a web service.</p><p>Depending on your needs, 
in steps (5.x.) above, you can configure your Spark settings by typing a 
command of the form:</p><div class="highlight shell"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1</pre></td><td class="code"><pre><span class="gp">$ 
</span>pio <span class="nb">command </span>command_parameters -- --master url 
--driver-memory <span class="o">{</span>0<span class="o">}</span>G 
--executor-memory <span class="o">{</span>1<span class="o">}</span>G --conf 
spark.akka.framesize<span class="o">={</span>2<span class="o">}</span> 
--total_executor_cores <span class="o">{</span>3<span
  class="o">}</span>
+</pre></td></tr></tbody></table> </div> <p>Only the latter commands are listed 
as these are some of the more commonly modified values. See the <a 
href="https://spark.apache.org/docs/latest/spark-standalone.html";>Spark 
documentation</a> and the <a 
href="http://predictionio.incubator.apache.org/resources/faq/";>PredictionIO 
FAQ&#39;s</a> for more information.</p><p><strong>Note:</strong> We recommend 
you set your driver memory to <code>1G</code> or <code>2G</code> as the data 
size when dealing with text can be very large.</p><h1 
id='detailed-explanation-of-dase' class='header-anchors'>Detailed Explanation 
of DASE</h1><h2 id='importing-data' class='header-anchors'>Importing 
Data</h2><p>In the quick start, email spam classification is used. This 
template can easily be modified for other types text classification.</p><p>If 
you want to import different sets of data, follow the Quick Start instructions 
to import data from different files. Make sure that the Data Source is modified 
according
 ly to match the <code>event</code>, <code>entityType</code>, and 
<code>properties</code> fields set for the specific dataset. The following 
section explains this in more detail.</p><h2 
id='data-source:-reading-event-data' class='header-anchors'>Data Source: 
Reading Event Data</h2><p>Now that the data has been imported into 
PredictionIO&#39;s Event Server, it needs to be read from storage to be used by 
the engine. This is precisely what the DataSource engine component is for, 
which is implemented in the template script <code>DataSource.scala</code>. The 
class <code>Observation</code> serves as a wrapper for storing the information 
about a news document needed to train a model. The attribute label refers to 
the label of the category a document belongs to, and text, stores the actual 
document content as a string. The class TrainingData is used to store an RDD of 
Observation objects along with the set of stop words.</p><p>The class 
<code>DataSourceParams</code> is used to specify the pa
 rameters needed to read and prepare the data for processing. This class is 
initialized with two parameters <code>appName</code> and <code>evalK</code>. 
The first parameter specifies your application name (i.e. MyTextApp), which is 
needed so that the DataSource component knows where to pull the event data 
from. The second parameter is used for model evaluation and specifies the 
number of folds to use in <a 
href="http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29";>cross-validation</a>
 when estimating a model performance metric.</p><p>The final and most important 
ingredient is the DataSource class. This is initialized with its corresponding 
parameter class, and extends <code>PDataSource</code>. This 
<strong>must</strong> implement the method <code>readTraining</code> which 
returns an instance of type TrainingData. This method completely relies on the 
defined private methods readEventData and readStopWords. Both of these 
functions read data observations as Event instances, c
 reate an RDD containing these events and finally transforms the RDD of events 
into an object of the appropriate type as seen below:</p><div class="highlight 
scala"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" 
style="text-align: right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35</pre></td><td class="code"><pre><span class="o">...</span>
+<span class="k">private</span> <span class="k">def</span> <span 
class="n">readEventData</span><span class="o">(</span><span 
class="n">sc</span><span class="k">:</span> <span 
class="kt">SparkContext</span><span class="o">)</span> <span class="k">:</span> 
<span class="kt">RDD</span><span class="o">[</span><span 
class="kt">Observation</span><span class="o">]</span> <span class="k">=</span> 
<span class="o">{</span>
+    <span class="c1">//Get RDD of Events.
+</span>    <span class="nc">PEventStore</span><span class="o">.</span><span 
class="n">find</span><span class="o">(</span>
+      <span class="n">appName</span> <span class="k">=</span> <span 
class="n">dsp</span><span class="o">.</span><span class="n">appName</span><span 
class="o">,</span>
+      <span class="n">entityType</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span 
class="s">"content"</span><span class="o">),</span> <span class="c1">// specify 
data entity type
+</span>      <span class="n">eventNames</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span class="nc">List</span><span 
class="o">(</span><span class="s">"e-mail"</span><span class="o">))</span> 
<span class="c1">// specify data event name
+</span>
+      <span class="c1">// Convert collected RDD of events to and RDD of 
Observation
+</span>      <span class="c1">// objects.
+</span>    <span class="o">)(</span><span class="n">sc</span><span 
class="o">).</span><span class="n">map</span><span class="o">(</span><span 
class="n">e</span> <span class="k">=&gt;</span> <span class="o">{</span>
+      <span class="k">val</span> <span class="n">label</span> <span 
class="k">:</span> <span class="kt">String</span> <span class="o">=</span> 
<span class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">String</span><span 
class="o">](</span><span class="s">"label"</span><span class="o">)</span>
+      <span class="nc">Observation</span><span class="o">(</span>
+        <span class="k">if</span> <span class="o">(</span><span 
class="n">label</span> <span class="o">==</span> <span 
class="s">"spam"</span><span class="o">)</span> <span class="mf">1.0</span> 
<span class="k">else</span> <span class="mf">0.0</span><span class="o">,</span>
+        <span class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">String</span><span 
class="o">](</span><span class="s">"text"</span><span class="o">),</span>
+        <span class="n">label</span>
+      <span class="o">)</span>
+    <span class="o">}).</span><span class="n">cache</span>
+  <span class="o">}</span>
+
+  <span class="c1">// Helper function used to store stop words from
+</span>  <span class="c1">// event server.
+</span>  <span class="k">private</span> <span class="k">def</span> <span 
class="n">readStopWords</span><span class="o">(</span><span class="n">sc</span> 
<span class="k">:</span> <span class="kt">SparkContext</span><span 
class="o">)</span> <span class="k">:</span> <span class="kt">Set</span><span 
class="o">[</span><span class="kt">String</span><span class="o">]</span> <span 
class="k">=</span> <span class="o">{</span>
+    <span class="nc">PEventStore</span><span class="o">.</span><span 
class="n">find</span><span class="o">(</span>
+      <span class="n">appName</span> <span class="k">=</span> <span 
class="n">dsp</span><span class="o">.</span><span class="n">appName</span><span 
class="o">,</span>
+      <span class="n">entityType</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span 
class="s">"resource"</span><span class="o">),</span>
+      <span class="n">eventNames</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span class="nc">List</span><span 
class="o">(</span><span class="s">"stopwords"</span><span class="o">))</span>
+
+    <span class="c1">//Convert collected RDD of strings to a string set.
+</span>    <span class="o">)(</span><span class="n">sc</span><span 
class="o">)</span>
+      <span class="o">.</span><span class="n">map</span><span 
class="o">(</span><span class="n">e</span> <span class="k">=&gt;</span> <span 
class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">String</span><span 
class="o">](</span><span class="s">"word"</span><span class="o">))</span>
+      <span class="o">.</span><span class="n">collect</span>
+      <span class="o">.</span><span class="n">toSet</span>
+  <span class="o">}</span>
+<span class="o">...</span>
+</pre></td></tr></tbody></table> </div> <p>Note that 
<code>readEventData</code> and <code>readStopWords</code> use different entity 
types and event names, but use the same application name. This is because the 
sample import script imports two different data types, documents and stop 
words. These field distinctions are required for distinguishing between the 
two. The method <code>readEval</code> is used to prepare the different 
cross-validation folds needed for evaluating your model and tuning hyper 
parameters.</p><p>Now, the default dataset used for training is contained in 
the file <code>data/emails.json</code> and contains a set of e-mail spam data. 
If we want to switch over to one of the other data sets we must make sure that 
the <code>eventNames</code> and <code>entityType</code> fields are changed 
accordingly.</p><p>In the data/ directory, you will find different sets of data 
files for different types of text classifcaiton application. The following show 
one observation from ea
 ch of the provided data files:</p> <ul> <li><code>emails.json</code>:</li> 
</ul> <div class="highlight shell"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1
+2</pre></td><td class="code"><pre><span class="o">{</span><span 
class="s2">"eventTime"</span>: <span 
class="s2">"2015-06-08T16:45:00.590+0000"</span>, <span 
class="s2">"entityId"</span>: 1, <span class="s2">"properties"</span>: <span 
class="o">{</span><span class="s2">"text"</span>: <span class="s2">"Subject: 
dobmeos with hgh my energy level has gone up ! stukm</span><span 
class="se">\n</span><span class="s2">introducing</span><span 
class="se">\n</span><span class="s2">doctor - formulated</span><span 
class="se">\n</span><span class="s2">hgh</span><span class="se">\n</span><span 
class="s2">human growth hormone - also called hgh</span><span 
class="se">\n</span><span class="s2">is referred to in medical science as the 
master hormone . it is very plentiful</span><span class="se">\n</span><span 
class="s2">when we are young , but near the age of twenty - one our bodies 
begin to produce</span><span class="se">\n</span><span class="s2">less of it . 
by the time we are forty nearly everyone i
 s deficient in hgh ,</span><span class="se">\n</span><span class="s2">and at 
eighty our production has normally diminished at least 90 - 95 % .</span><span 
class="se">\n</span><span class="s2">advantages of hgh :</span><span 
class="se">\n</span><span class="s2">- increased muscle strength</span><span 
class="se">\n</span><span class="s2">- loss in body fat</span><span 
class="se">\n</span><span class="s2">- increased bone density</span><span 
class="se">\n</span><span class="s2">- lower blood pressure</span><span 
class="se">\n</span><span class="s2">- quickens wound healing</span><span 
class="se">\n</span><span class="s2">- reduces cellulite</span><span 
class="se">\n</span><span class="s2">- improved vision</span><span 
class="se">\n</span><span class="s2">- wrinkle disappearance</span><span 
class="se">\n</span><span class="s2">- increased skin thickness 
texture</span><span class="se">\n</span><span class="s2">- increased energy 
levels</span><span class="se">\n</span><span class="s2">- 
 improved sleep and emotional stability</span><span class="se">\n</span><span 
class="s2">- improved memory and mental alertness</span><span 
class="se">\n</span><span class="s2">- increased sexual potency</span><span 
class="se">\n</span><span class="s2">- resistance to common illness</span><span 
class="se">\n</span><span class="s2">- strengthened heart muscle</span><span 
class="se">\n</span><span class="s2">- controlled cholesterol</span><span 
class="se">\n</span><span class="s2">- controlled mood swings</span><span 
class="se">\n</span><span class="s2">- new hair growth and color 
restore</span><span class="se">\n</span><span class="s2">read</span><span 
class="se">\n</span><span class="s2">more at this website</span><span 
class="se">\n</span><span class="s2">unsubscribe</span><span 
class="se">\n</span><span class="s2">"</span>, <span class="s2">"label"</span>: 
<span class="s2">"spam"</span><span class="o">}</span>, <span 
class="s2">"event"</span>: <span class="s2">"e-mail"</span>, <spa
 n class="s2">"entityType"</span>: <span class="s2">"content"</span><span 
class="o">}</span>
+
+</pre></td></tr></tbody></table> </div> <ul> 
<li><code>20newsgroups.json</code>:</li> </ul> <div class="highlight 
shell"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" 
style="text-align: right"><pre class="lineno">1</pre></td><td 
class="code"><pre><span class="o">{</span><span class="s2">"entityType"</span>: 
<span class="s2">"source"</span>, <span class="s2">"eventTime"</span>: <span 
class="s2">"2015-06-08T18:01:55.003+0000"</span>, <span 
class="s2">"event"</span>: <span class="s2">"documents"</span>, <span 
class="s2">"entityId"</span>: 1, <span class="s2">"properties"</span>: <span 
class="o">{</span><span class="s2">"category"</span>: <span 
class="s2">"sci.crypt"</span>, <span class="s2">"text"</span>: <span 
class="s2">"From: [email protected] (Rob deFriesse)</span><span 
class="se">\n</span><span class="s2">Subject: Can DES code be shipped to 
Canada?</span><span class="se">\n</span><span class="s2">Article-I.D.: 
fripp.1993Apr22.125402.27561</span><span class="se">\
 n</span><span class="s2">Reply-To: [email protected]</span><span 
class="se">\n</span><span class="s2">Organization: Cadre Technologies 
Inc.</span><span class="se">\n</span><span class="s2">Lines: 13</span><span 
class="se">\n</span><span class="s2">Nntp-Posting-Host: 
192.9.200.19</span><span class="se">\n\n</span><span class="s2">Someone in 
Canada asked me to send him some public domain DES file</span><span 
class="se">\n</span><span class="s2">encryption code I have.  Is it legal for 
me to send it?</span><span class="se">\n\n</span><span 
class="s2">Thanx.</span><span class="se">\n</span><span 
class="s2">--</span><span class="se">\n</span><span class="s2">Eschew 
Obfuscation</span><span class="se">\n\n</span><span class="s2">Rob deFriesse    
                Mail:  [email protected]</span><span class="se">\n</span><span 
class="s2">Cadre Technologies Inc.          Phone:  (401) 351-5950</span><span 
class="se">\n</span><span class="s2">222 Richmond St.                 Fax:    
(401) 351-7380</
 span><span class="se">\n</span><span class="s2">Providence, RI  
02903</span><span class="se">\n\n</span><span class="s2">I don't speak for my 
employer.</span><span class="se">\n</span><span class="s2">"</span>, <span 
class="s2">"label"</span>: 11.0<span class="o">}}</span>
+</pre></td></tr></tbody></table> </div> <ul> 
<li><code>sentimentanalysis.json</code>:</li> </ul> <div class="highlight 
shell"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" 
style="text-align: right"><pre class="lineno">1</pre></td><td 
class="code"><pre><span class="o">{</span><span class="s2">"eventTime"</span>: 
<span class="s2">"2015-06-08T16:58:14.278+0000"</span>, <span 
class="s2">"entityId"</span>: 23714, <span class="s2">"entityType"</span>: 
<span class="s2">"source"</span>, <span class="s2">"properties"</span>: <span 
class="o">{</span><span class="s2">"phrase"</span>: <span class="s2">"Tosca 's 
intoxicating ardor"</span>, <span class="s2">"sentiment"</span>: 3<span 
class="o">}</span>, <span class="s2">"event"</span>: <span 
class="s2">"phrases"</span><span class="o">}</span>
+</pre></td></tr></tbody></table> </div> <p>Now, note that the 
<code>entityType</code>, <code>event</code>, and <code>properties</code> fields 
for the <code>20newsgroups.json</code> dataset differ from the default 
<code>emails.json</code> set. Default DataSource implementation is to read from 
<code>email.json</code> data set. If you want to use others such as newsgroups 
data set, the engine&#39;s Data Source component must be modified accordingly. 
To do this, you need only modify the method <code>readEventData</code> as 
follows:</p><h3 id='modify-datasource-to-read-<code>20newsgroups.json</code>' 
class='header-anchors' >Modify DataSource to Read 
<code>20newsgroups.json</code></h3><div class="highlight scala"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18</pre></td><td class="code"><pre><span class="k">private</span> <span 
class="k">def</span> <span class="n">readEventData</span><span 
class="o">(</span><span class="n">sc</span><span class="k">:</span> <span 
class="kt">SparkContext</span><span class="o">)</span> <span class="k">:</span> 
<span class="kt">RDD</span><span class="o">[</span><span 
class="kt">Observation</span><span class="o">]</span> <span class="k">=</span> 
<span class="o">{</span>
+    <span class="c1">//Get RDD of Events.
+</span>    <span class="nc">PEventStore</span><span class="o">.</span><span 
class="n">find</span><span class="o">(</span>
+      <span class="n">appName</span> <span class="k">=</span> <span 
class="n">dsp</span><span class="o">.</span><span class="n">appName</span><span 
class="o">,</span>
+      <span class="n">entityType</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span 
class="s">"source"</span><span class="o">),</span> <span class="c1">// specify 
data entity type
+</span>      <span class="n">eventNames</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span class="nc">List</span><span 
class="o">(</span><span class="s">"documents"</span><span class="o">))</span> 
<span class="c1">// specify data event name
+</span>
+      <span class="c1">// Convert collected RDD of events to and RDD of 
Observation
+</span>      <span class="c1">// objects.
+</span>    <span class="o">)(</span><span class="n">sc</span><span 
class="o">).</span><span class="n">map</span><span class="o">(</span><span 
class="n">e</span> <span class="k">=&gt;</span> <span class="o">{</span>
+
+      <span class="nc">Observation</span><span class="o">(</span>
+        <span class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">Double</span><span 
class="o">](</span><span class="s">"label"</span><span class="o">),</span>
+        <span class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">String</span><span 
class="o">](</span><span class="s">"text"</span><span class="o">),</span>
+        <span class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">String</span><span 
class="o">](</span><span class="s">"category"</span><span class="o">)</span>
+      <span class="o">)</span>
+    <span class="o">}).</span><span class="n">cache</span>
+  <span class="o">}</span>
+</pre></td></tr></tbody></table> </div> <h3 
id='modify-datasource-to-read-<code>sentimentanalysis.json</code>' 
class='header-anchors' >Modify DataSource to Read 
<code>sentimentanalysis.json</code></h3><div class="highlight scala"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19</pre></td><td class="code"><pre><span class="k">private</span> <span 
class="k">def</span> <span class="n">readEventData</span><span 
class="o">(</span><span class="n">sc</span><span class="k">:</span> <span 
class="kt">SparkContext</span><span class="o">)</span> <span class="k">:</span> 
<span class="kt">RDD</span><span class="o">[</span><span 
class="kt">Observation</span><span class="o">]</span> <span class="k">=</span> 
<span class="o">{</span>
+    <span class="c1">//Get RDD of Events.
+</span>    <span class="nc">PEventStore</span><span class="o">.</span><span 
class="n">find</span><span class="o">(</span>
+      <span class="n">appName</span> <span class="k">=</span> <span 
class="n">dsp</span><span class="o">.</span><span class="n">appName</span><span 
class="o">,</span>
+      <span class="n">entityType</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span 
class="s">"source"</span><span class="o">),</span> <span class="c1">// specify 
data entity type
+</span>      <span class="n">eventNames</span> <span class="k">=</span> <span 
class="nc">Some</span><span class="o">(</span><span class="nc">List</span><span 
class="o">(</span><span class="s">"phrases"</span><span class="o">))</span> 
<span class="c1">// specify data event name
+</span>
+      <span class="c1">// Convert collected RDD of events to and RDD of 
Observation
+</span>      <span class="c1">// objects.
+</span>    <span class="o">)(</span><span class="n">sc</span><span 
class="o">).</span><span class="n">map</span><span class="o">(</span><span 
class="n">e</span> <span class="k">=&gt;</span> <span class="o">{</span>
+      <span class="k">val</span> <span class="n">label</span> <span 
class="k">=</span> <span class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">Double</span><span 
class="o">](</span><span class="s">"sentiment"</span><span class="o">)</span>
+
+      <span class="nc">Observation</span><span class="o">(</span>
+        <span class="n">label</span><span class="o">,</span>
+        <span class="n">e</span><span class="o">.</span><span 
class="n">properties</span><span class="o">.</span><span 
class="n">get</span><span class="o">[</span><span class="kt">String</span><span 
class="o">](</span><span class="s">"phrase"</span><span class="o">),</span>
+        <span class="n">label</span><span class="o">.</span><span 
class="n">toString</span>
+      <span class="o">)</span>
+    <span class="o">}).</span><span class="n">cache</span>
+  <span class="o">}</span>
+</pre></td></tr></tbody></table> </div> <p>Note that <code>event</code> field 
in the json file refers to the <code>eventNames</code> field in the 
<code>readEventData</code> method. When using this engine with a custom data 
set, you need to make sure that the respective json fields match with the 
corresponding fields in the DataSource component. We have included a data 
sanity check with this engine component that lets you know if your data is 
actually being read in. If you have 0 observations being read, you should see 
the following output when your training process performs the Training Data 
sanity check:</p><p><code>Data set is empty, make sure event fields match 
imported data.</code></p><p>This data sanity check is a PredictionIO feature 
available for your <code>TrainingData</code> and <code>PreparedData</code> 
classes. The following code block demonstrates how the sanity check is 
implemented:</p><div class="highlight scala"><table style="border-spacing: 
0"><tbody><tr><td class="g
 utter gl" style="text-align: right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27</pre></td><td class="code"><pre><span class="k">class</span> <span 
class="nc">TrainingData</span><span class="o">(</span>
+  <span class="k">val</span> <span class="n">data</span> <span 
class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span 
class="kt">Observation</span><span class="o">],</span>
+  <span class="k">val</span> <span class="n">stopWords</span> <span 
class="k">:</span> <span class="kt">Set</span><span class="o">[</span><span 
class="kt">String</span><span class="o">]</span>
+<span class="o">)</span> <span class="k">extends</span> <span 
class="nc">Serializable</span> <span class="k">with</span> <span 
class="nc">SanityCheck</span> <span class="o">{</span>
+
+  <span class="c1">// Sanity check to make sure your data is being fed in 
correctly.
+</span>
+  <span class="k">def</span> <span class="n">sanityCheck</span> <span 
class="o">{</span>
+    <span class="k">try</span> <span class="o">{</span>
+      <span class="k">val</span> <span class="n">obs</span> <span 
class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span 
class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span 
class="n">data</span><span class="o">.</span><span 
class="n">takeSample</span><span class="o">(</span><span 
class="kc">false</span><span class="o">,</span> <span class="mi">5</span><span 
class="o">).</span><span class="n">map</span><span class="o">(</span><span 
class="k">_</span><span class="o">.</span><span class="n">label</span><span 
class="o">)</span>
+
+      <span class="n">println</span><span class="o">()</span>
+      <span class="o">(</span><span class="mi">0</span> <span 
class="n">until</span> <span class="mi">5</span><span class="o">).</span><span 
class="n">foreach</span><span class="o">(</span>
+        <span class="n">k</span> <span class="k">=&gt;</span> <span 
class="n">println</span><span class="o">(</span><span class="s">"Observation 
"</span> <span class="o">+</span> <span class="o">(</span><span 
class="n">k</span> <span class="o">+</span> <span class="mi">1</span><span 
class="o">)</span> <span class="o">+</span><span class="s">" label: "</span> 
<span class="o">+</span> <span class="n">obs</span><span 
class="o">(</span><span class="n">k</span><span class="o">))</span>
+      <span class="o">)</span>
+      <span class="n">println</span><span class="o">()</span>
+    <span class="o">}</span> <span class="k">catch</span> <span 
class="o">{</span>
+      <span class="k">case</span> <span class="o">(</span><span 
class="n">e</span> <span class="k">:</span> <span 
class="kt">ArrayIndexOutOfBoundsException</span><span class="o">)</span> <span 
class="k">=&gt;</span> <span class="o">{</span>
+        <span class="n">println</span><span class="o">()</span>
+        <span class="n">println</span><span class="o">(</span><span 
class="s">"Data set is empty, make sure event fields match imported 
data."</span><span class="o">)</span>
+        <span class="n">println</span><span class="o">()</span>
+      <span class="o">}</span>
+    <span class="o">}</span>
+
+  <span class="o">}</span>
+
+<span class="o">}</span>
+</pre></td></tr></tbody></table> </div> <h2 
id='preparator-:-data-processing-with-dase' class='header-anchors'>Preparator : 
Data Processing With DASE</h2><p>Recall that the Preparator stage is used for 
doing any prior data processing needed to fit a predictive model. In line with 
the separation of concerns, the Data Model implementation, PreparedData, is 
built to do the heavy lifting needed for this data processing. The Preparator 
must simply implement the prepare method which outputs an object of type 
PreparedData. This requires you to specify two n-gram window components, and 
two inverse i.d.f. window components (these terms will be defined in the 
following section). Therefore a custom class of parameters for the Preparator 
component, PreparatorParams, must be incorporated. The code defining the full 
Preparator component is given below:</p><div class="highlight scala"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22</pre></td><td class="code"><pre><span class="c1">// 1. Initialize 
Preparator parameters. Recall that for our data
+// representation we are only required to input the n-gram window
+// components.
+</span>
+<span class="k">case</span> <span class="k">class</span> <span 
class="nc">PreparatorParams</span><span class="o">(</span>
+  <span class="n">nGram</span><span class="k">:</span> <span 
class="kt">Int</span><span class="o">,</span>
+  <span class="n">numFeatures</span><span class="k">:</span> <span 
class="kt">Int</span> <span class="o">=</span> <span 
class="mi">5000</span><span class="o">,</span>
+  <span class="nc">SPPMI</span><span class="k">:</span> <span 
class="kt">Boolean</span>
+<span class="o">)</span> <span class="k">extends</span> <span 
class="nc">Params</span>
+
+
+
+<span class="c1">// 2. Initialize your Preparator class.
+</span>
+<span class="k">class</span> <span class="nc">Preparator</span><span 
class="o">(</span><span class="n">pp</span><span class="k">:</span> <span 
class="kt">PreparatorParams</span><span class="o">)</span> <span 
class="k">extends</span> <span class="nc">PPreparator</span><span 
class="o">[</span><span class="kt">TrainingData</span>, <span 
class="kt">PreparedData</span><span class="o">]</span> <span class="o">{</span>
+
+  <span class="c1">// Prepare your training data.
+</span>  <span class="k">def</span> <span class="n">prepare</span><span 
class="o">(</span><span class="n">sc</span> <span class="k">:</span> <span 
class="kt">SparkContext</span><span class="o">,</span> <span 
class="n">td</span><span class="k">:</span> <span 
class="kt">TrainingData</span><span class="o">)</span><span class="k">:</span> 
<span class="kt">PreparedData</span> <span class="o">=</span> <span 
class="o">{</span>
+    <span class="k">new</span> <span class="nc">PreparedData</span><span 
class="o">(</span><span class="n">td</span><span class="o">,</span> <span 
class="n">pp</span><span class="o">.</span><span class="n">nGram</span><span 
class="o">)</span>
+  <span class="o">}</span>
+<span class="o">}</span>
+
+</pre></td></tr></tbody></table> </div> <p>The simplicity of this stage 
implementation truly exemplifies one of the benefits of using the PredictionIO 
platform. For developers, it is easy to incorporate different classes and tools 
into the DASE framework so that the process of creating an engine is greatly 
simplified which helps increase your productivity. For data scientists, the 
load of implementation details you need to worry about is minimized so that you 
can focus on what is important to you: training a good predictive 
model.</p><p>The following subsection explains the class PreparedData, which 
actually handles the transformation of text documents to feature 
vectors.</p><h3 id='prepareddata:-text-vectorization-and-feature-reduction' 
class='header-anchors'>PreparedData: Text Vectorization and Feature 
Reduction</h3><p>The Scala class PreparedData which takes the parameters td, 
nGram, where td is an object of class TrainingData. The other parameter 
specifies the n-gram parametriza
 tion which will be described shortly.</p><p>It will be easier to explain the 
preparation process with an example, so consider the document 
\(d\):</p><p><code>&quot;Hello, my name is Marco.&quot;</code></p><p>The first 
thing you need to do is break up \(d\) into an array of &quot;allowed 
tokens.&quot; You can think of a token as a terminating sequence of characters 
that exist in a document (think of a word in a sentence). For example, the list 
of tokens that appear in \(d\) is:</p><div class="highlight scala"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1</pre></td><td class="code"><pre><span 
class="k">val</span> <span class="n">A</span> <span class="k">=</span> <span 
class="nc">Array</span><span class="o">(</span><span 
class="s">"Hello"</span><span class="o">,</span> <span 
class="s">","</span><span class="o">,</span> <span class="s">"my"</span><span 
class="o">,</span>  <span class="s">"name"</span><span class="o">,</
 span> <span class="s">"is"</span><span class="o">,</span> <span 
class="s">"Marco"</span><span class="o">,</span> <span 
class="s">"."</span><span class="o">)</span>
+</pre></td></tr></tbody></table> </div> <p>Recall that a set of stop words was 
also imported in the previous sections. This set of stop words contains all the 
words (or tokens) that you do not want to include once documents are tokenized. 
Those tokens that appear in \(d\) and are not contained in the set of stop 
words will be called allowed tokens. So, if the set of stop words is 
<code>{&quot;my&quot;, &quot;is&quot;}</code>, then the list of allowed tokens 
appearing in \(d\) is:</p><div class="highlight scala"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1</pre></td><td class="code"><pre><span 
class="k">val</span> <span class="n">A</span> <span class="k">=</span> <span 
class="nc">Array</span><span class="o">(</span><span 
class="s">"Hello"</span><span class="o">,</span> <span 
class="s">","</span><span class="o">,</span>  <span 
class="s">"name"</span><span class="o">,</span> <span 
class="s">"Marco"</span><span class="
 o">,</span> <span class="s">"."</span><span class="o">)</span>
+</pre></td></tr></tbody></table> </div> <p>The next step in the data 
representation is to take the array of allowed tokens and extract a set of 
n-grams and a corresponding value indicating the number of times a given n-gram 
appears. The set of n-grams for n equal to 1 and 2 in the running example is 
the set of elements of the form <code>[A(</code>\(i\)<code>)]</code> and 
<code>[A(</code>\(j\)<code>), A(</code>\(j + 1\)<code>)]</code>, respectively. 
In the general case, the set of n-grams extracted from an array of allowed 
tokens <code>A</code> will be of the form <code>[A(</code>\(i\)<code>), 
A(</code>\(i + 1\)<code>), ..., A(</code>\(i + n - 1\)<code>)]</code> for \(i = 
0, 1, 2, ...,\) <code>A.size</code> \(- n\). You can set <code>n</code> with 
the <code>nGram</code> parameter option in your 
<code>PreparatorParams</code>.</p><p>We use MLLib&#39;s <code>HashingTF</code> 
class to implement the conversion from text to term frequency vectors, and can 
be seen in the following method of
  the class <code>PreparedData</code>:</p><div class="highlight scala"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18</pre></td><td class="code"><pre><span class="o">...</span>
+   <span class="c1">// 1. Hashing function: Text -&gt; term frequency vector.
+</span>
+  <span class="k">private</span> <span class="k">val</span> <span 
class="n">hasher</span> <span class="k">=</span> <span class="k">new</span> 
<span class="nc">HashingTF</span><span class="o">()</span>
+
+  <span class="k">private</span> <span class="k">def</span> <span 
class="n">hashTF</span> <span class="o">(</span><span class="n">text</span> 
<span class="k">:</span> <span class="kt">String</span><span class="o">)</span> 
<span class="k">:</span> <span class="kt">Vector</span> <span 
class="o">=</span> <span class="o">{</span>
+    <span class="k">val</span> <span class="n">newList</span> <span 
class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span 
class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span 
class="n">text</span><span class="o">.</span><span class="n">split</span><span 
class="o">(</span><span class="s">" "</span><span class="o">)</span>
+    <span class="o">.</span><span class="n">sliding</span><span 
class="o">(</span><span class="n">nGram</span><span class="o">)</span>
+    <span class="o">.</span><span class="n">map</span><span 
class="o">(</span><span class="k">_</span><span class="o">.</span><span 
class="n">mkString</span><span class="o">)</span>
+    <span class="o">.</span><span class="n">toArray</span>
+
+    <span class="n">hasher</span><span class="o">.</span><span 
class="n">transform</span><span class="o">(</span><span 
class="n">newList</span><span class="o">)</span>
+  <span class="o">}</span>
+
+  <span class="c1">// 2. Term frequency vector -&gt; t.f.-i.d.f. vector.
+</span>
+  <span class="k">val</span> <span class="n">idf</span> <span 
class="k">:</span> <span class="kt">IDFModel</span> <span class="o">=</span> 
<span class="k">new</span> <span class="nc">IDF</span><span 
class="o">().</span><span class="n">fit</span><span class="o">(</span><span 
class="n">td</span><span class="o">.</span><span class="n">data</span><span 
class="o">.</span><span class="n">map</span><span class="o">(</span><span 
class="n">e</span> <span class="k">=&gt;</span> <span 
class="n">hashTF</span><span class="o">(</span><span class="n">e</span><span 
class="o">.</span><span class="n">text</span><span class="o">)))</span>
+<span class="o">...</span>
+</pre></td></tr></tbody></table> </div> <p>The next step is, once all of the 
observations have been hashed, to collect all n-grams and compute their 
corresponding <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf";>t.f.-i.d.f. 
value</a>. The t.f.-i.d.f. transformation is defined for n-grams, and helps to 
give less weight to those n-grams that appear with high frequency across all 
documents, and vice versa. This helps to leverage the predictive power of those 
words that appear rarely, but can make a big difference in the categorization 
of a given text document. This is implemented using MLLib&#39;s 
<code>IDF</code> and <code>IDFModel</code> classes:</p><div class="highlight 
scala"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" 
style="text-align: right"><pre class="lineno">1
+2
+3</pre></td><td class="code"><pre><span class="c1">// 2. Term frequency vector 
-&gt; t.f.-i.d.f. vector.
+</span>
+  <span class="k">val</span> <span class="n">idf</span> <span 
class="k">:</span> <span class="kt">IDFModel</span> <span class="o">=</span> 
<span class="k">new</span> <span class="nc">IDF</span><span 
class="o">().</span><span class="n">fit</span><span class="o">(</span><span 
class="n">td</span><span class="o">.</span><span class="n">data</span><span 
class="o">.</span><span class="n">map</span><span class="o">(</span><span 
class="n">e</span> <span class="k">=&gt;</span> <span 
class="n">hashTF</span><span class="o">(</span><span class="n">e</span><span 
class="o">.</span><span class="n">text</span><span class="o">)))</span>
+</pre></td></tr></tbody></table> </div> <p>The last two functions that will be 
mentioned are the methods you will actually use for the data transformation. 
The method transform takes a document and outputs a sparse vector (MLLib 
implementation). The transformData method simply transforms the TrainingData 
input (a corpus of documents) into a set of vectors that can now be used for 
training. The method transform is used both to transform the training data and 
future queries.</p><div class="highlight scala"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14</pre></td><td class="code"><pre><span class="o">...</span>
+<span class="c1">// 3. Document Transformer: text =&gt; tf-idf vector.
+</span>
+  <span class="k">def</span> <span class="n">transform</span><span 
class="o">(</span><span class="n">text</span> <span class="k">:</span> <span 
class="kt">String</span><span class="o">)</span><span class="k">:</span> <span 
class="kt">Vector</span> <span class="o">=</span> <span class="o">{</span>
+    <span class="c1">// Map(n-gram -&gt; document tf)
+</span>    <span class="n">idf</span><span class="o">.</span><span 
class="n">transform</span><span class="o">(</span><span 
class="n">hashTF</span><span class="o">(</span><span class="n">text</span><span 
class="o">))</span>
+  <span class="o">}</span>
+
+
+  <span class="c1">// 4. Data Transformer: RDD[documents] =&gt; 
RDD[LabeledPoints]
+</span>
+  <span class="k">val</span> <span class="n">transformedData</span><span 
class="k">:</span> <span class="kt">RDD</span><span class="o">[(</span><span 
class="kt">LabeledPoint</span><span class="o">)]</span> <span 
class="k">=</span> <span class="o">{</span>
+    <span class="n">td</span><span class="o">.</span><span 
class="n">data</span><span class="o">.</span><span class="n">map</span><span 
class="o">(</span><span class="n">e</span> <span class="k">=&gt;</span> <span 
class="nc">LabeledPoint</span><span class="o">(</span><span 
class="n">e</span><span class="o">.</span><span class="n">label</span><span 
class="o">,</span> <span class="n">transform</span><span 
class="o">(</span><span class="n">e</span><span class="o">.</span><span 
class="n">text</span><span class="o">)))</span>
+  <span class="o">}</span>
+</pre></td></tr></tbody></table> </div> <p>The last and final object 
implemented in this class simply creates a Map with keys being class labels and 
values, the corresponding category.</p><div class="highlight scala"><table 
style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: 
right"><pre class="lineno">1
+2</pre></td><td class="code"><pre> <span class="c1">// 5. Finally extract 
category map, associating label to category.
+</span>  <span class="k">val</span> <span class="n">categoryMap</span> <span 
class="k">=</span> <span class="n">td</span><span class="o">.</span><span 
class="n">data</span><span class="o">.</span><span class="n">map</span><span 
class="o">(</span><span class="n">e</span> <span class="k">=&gt;</span> <span 
class="o">(</span><span class="n">e</span><span class="o">.</span><span 
class="n">label</span><span class="o">,</span> <span class="n">e</span><span 
class="o">.</span><span class="n">category</span><span 
class="o">)).</span><span class="n">collectAsMap</span>
+</pre></td></tr></tbody></table> </div> <h2 id='algorithm-component' 
class='header-anchors'>Algorithm Component</h2><p>The algorithm components in 
this engine, <code>NBAlgorithm</code> and <code>LRAlgorithm</code>, actually 
follows a very general form. Firstly, a parameter class must again be 
initialized to feed in the corresponding Algorithm model parameters. For 
example, NBAlgorithm incorporates NBAlgorithmParams which holds the appropriate 
additive smoothing parameter lambda for the Naive Bayes model.</p><p>The main 
class of interest in this component is the class that extends <a 
href="https://predictionio.incubator.apache.org/api/current/#org.apache.predictionio.controller.P2LAlgorithm";>P2LAlgorithm</a>.
 This class must implement a method named train which will output your 
predictive model (as a concrete object, this will be implemented via a Scala 
class). It must also implement a predict method that transforms a query to an 
appropriate feature vector, and uses this to predict w
 ith the fitted model. The vectorization function is implemented by a 
PreparedData object, and the categorization (prediction) is handled by an 
instance of the NBModel implementation. Again, this demonstrates the facility 
with which different models can be incorporated into PredictionIO&#39;s DASE 
architecture.</p><p>The model class itself will be discussed in the following 
section, however, turn your attention to the TextManipulationEngine object 
defined in the script <code>Engine.scala</code>. You can see here that the 
engine is initialized by specifying the DataSource, Preparator, and Serving 
classes, as well as a Map of algorithm names to Algorithm classes. This tells 
the engine which algorithms to run. In practice, you can have as many 
statistical learning models as you&#39;d like, you simply have to implement a 
new algorithm component to do this. However, this general design form will 
persist, and the main meat of the work should be in the implementation of your 
model class.</p
 ><p>The following subsection will go over our Naive Bayes implementation in 
 >NBModel.</p><h3 id='naive-bayes-classification' class='header-anchors'>Naive 
 >Bayes Classification</h3><p>This Training Model class only uses the 
 >Multinomial Naive Bayes <a 
 >href="https://spark.apache.org/docs/latest/mllib-naive-bayes.html";>implementation</a>
 > found in the Spark MLLib library. However, recall that the predicted results 
 >required in the specifications listed in the overview are of the 
 >form:</p><p><code>{category: String, confidence: Double}</code>.</p><p>The 
 >confidence value should really be interpreted as the probability that a 
 >document belongs to a category given its vectorized form. Note that 
 >MLLib&#39;s Naive Bayes model has the class members pi (\(\pi\)), and theta 
 >(\(\theta\)). \(\pi\) is a vector of log prior class probabilities, which 
 >shows your prior beliefs regarding the probability that an arbitrary document 
 >belongs in a category. \(\theta\) is a C \(\times\) D matrix, where C is the n
 umber of classes, and D, the number of features, giving the log probabilities 
that parametrize the Multinomial likelihood model assumed for each class. The 
multinomial model is easiest to think about as a problem of randomly throwing 
balls into bins, where the ball lands in each bin with a certain probability. 
The model treats each n-gram as a bin, and the corresponding t.f.-i.d.f. value 
as the number of balls thrown into it. The likelihood is the probability of 
observing a (vectorized) document given that it comes from a particular 
class.</p><p>Now, letting \(\mathbf{x}\) be a vectorized text document, then it 
can be shown that the vector</p><p>$$ \frac{\exp\left(\pi + 
\theta\mathbf{x}\right)}{\left|\left|\exp\left(\pi + 
\theta\mathbf{x}\right)\right|\right|} $$</p><p>is a vector with C components 
that represent the posterior class membership probabilities for the document 
given \(\mathbf{x}\). That is, the update belief regarding what category this 
document belongs to after observ
 ing its vectorized form. This is the motivation behind defining the class 
NBModel which uses Spark MLLib&#39;s NaiveBayesModel, but implements a separate 
prediction method.</p><p>The private methods innerProduct and getScores are 
implemented to do the matrix computation above.</p><div class="highlight 
scala"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" 
style="text-align: right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33</pre></td><td class="code"><pre><span class="o">...</span>
+ <span class="c1">// 2. Set up linear algebra framework.
+</span>
+  <span class="k">private</span> <span class="k">def</span> <span 
class="n">innerProduct</span> <span class="o">(</span><span class="n">x</span> 
<span class="k">:</span> <span class="kt">Array</span><span 
class="o">[</span><span class="kt">Double</span><span class="o">],</span> <span 
class="n">y</span> <span class="k">:</span> <span class="kt">Array</span><span 
class="o">[</span><span class="kt">Double</span><span class="o">])</span> <span 
class="k">:</span> <span class="kt">Double</span> <span class="o">=</span> 
<span class="o">{</span>
+    <span class="n">x</span><span class="o">.</span><span 
class="n">zip</span><span class="o">(</span><span class="n">y</span><span 
class="o">).</span><span class="n">map</span><span class="o">(</span><span 
class="n">e</span> <span class="k">=&gt;</span> <span class="n">e</span><span 
class="o">.</span><span class="n">_1</span> <span class="o">*</span> <span 
class="n">e</span><span class="o">.</span><span class="n">_2</span><span 
class="o">).</span><span class="n">sum</span>
+  <span class="o">}</span>
+
+  <span class="k">val</span> <span class="n">normalize</span> <span 
class="k">=</span> <span class="o">(</span><span class="n">u</span><span 
class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span 
class="kt">Double</span><span class="o">])</span> <span class="k">=&gt;</span> 
<span class="o">{</span>
+    <span class="k">val</span> <span class="n">uSum</span> <span 
class="k">=</span> <span class="n">u</span><span class="o">.</span><span 
class="n">sum</span>
+
+    <span class="n">u</span><span class="o">.</span><span 
class="n">map</span><span class="o">(</span><span class="n">e</span> <span 
class="k">=&gt;</span> <span class="n">e</span> <span class="o">/</span> <span 
class="n">uSum</span><span class="o">)</span>
+  <span class="o">}</span>
+
+
+
+  <span class="c1">// 3. Given a document string, return a vector of 
corresponding
+</span>  <span class="c1">// class membership probabilities.
+</span>
+  <span class="k">private</span> <span class="k">def</span> <span 
class="n">getScores</span><span class="o">(</span><span 
class="n">doc</span><span class="k">:</span> <span 
class="kt">String</span><span class="o">)</span><span class="k">:</span> <span 
class="kt">Array</span><span class="o">[</span><span 
class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span 
class="o">{</span>
+    <span class="c1">// Helper function used to normalize probability scores.
+</span>    <span class="c1">// Returns an object of type Array[Double]
+</span>
+    <span class="c1">// Vectorize query,
+</span>    <span class="k">val</span> <span class="n">x</span><span 
class="k">:</span> <span class="kt">Vector</span> <span class="o">=</span> 
<span class="n">pd</span><span class="o">.</span><span 
class="n">transform</span><span class="o">(</span><span 
class="n">doc</span><span class="o">)</span>
+
+    <span class="n">normalize</span><span class="o">(</span>
+      <span class="n">nb</span><span class="o">.</span><span 
class="n">pi</span>
+      <span class="o">.</span><span class="n">zip</span><span 
class="o">(</span><span class="n">nb</span><span class="o">.</span><span 
class="n">theta</span><span class="o">)</span>
+      <span class="o">.</span><span class="n">map</span><span 
class="o">(</span>
+      <span class="n">e</span> <span class="k">=&gt;</span> <span 
class="n">exp</span><span class="o">(</span><span 
class="n">innerProduct</span><span class="o">(</span><span 
class="n">e</span><span class="o">.</span><span class="n">_2</span><span 
class="o">,</span> <span class="n">x</span><span class="o">.</span><span 
class="n">toArray</span><span class="o">)</span> <span class="o">+</span> <span 
class="n">e</span><span class="o">.</span><span class="n">_1</span><span 
class="o">))</span>
+    <span class="o">)</span>
+  <span class="o">}</span>
+<span class="o">...</span>
+</pre></td></tr></tbody></table> </div> <p>Once you have a vector of class 
probabilities, you can classify the text document to the category with highest 
posterior probability, and, finally, return both the category as well as the 
probability of belonging to that category (i.e. the confidence in the 
prediction) given the observed data. This is implemented in the method 
predict.</p><div class="highlight scala"><table style="border-spacing: 
0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre 
class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9</pre></td><td class="code"><pre><span class="o">...</span>
+  <span class="c1">// 4. Implement predict method for our model using
+</span>  <span class="c1">// the prediction rule given in tutorial.
+</span>
+  <span class="k">def</span> <span class="n">predict</span><span 
class="o">(</span><span class="n">doc</span> <span class="k">:</span> <span 
class="kt">String</span><span class="o">)</span> <span class="k">:</span> <span 
class="kt">PredictedResult</span> <span class="o">=</span> <span 
class="o">{</span>
+    <span class="k">val</span> <span class="n">x</span><span 
class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span 
class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span 
class="n">getScores</span><span class="o">(</span><span 
class="n">doc</span><span class="o">)</span>
+    <span class="k">val</span> <span class="n">y</span><span 
class="k">:</span> <span class="o">(</span><span class="kt">Double</span><span 
class="o">,</span> <span class="kt">Double</span><span class="o">)</span> <span 
class="k">=</span> <span class="o">(</span><span class="n">nb</span><span 
class="o">.</span><span class="n">labels</span> <span class="n">zip</span> 
<span class="n">x</span><span class="o">).</span><span 
class="n">maxBy</span><span class="o">(</span><span class="k">_</span><span 
class="o">.</span><span class="n">_2</span><span class="o">)</span>
+    <span class="nc">PredictedResult</span><span class="o">(</span><span 
class="n">pd</span><span class="o">.</span><span 
class="n">categoryMap</span><span class="o">.</span><span 
class="n">getOrElse</span><span class="o">(</span><span class="n">y</span><span 
class="o">.</span><span class="n">_1</span><span class="o">,</span> <span 
class="s">""</span><span class="o">),</span> <span class="n">y</span><span 
class="o">.</span><span class="n">_2</span><span class="o">)</span>
+  <span class="o">}</span>
+</pre></td></tr></tbody></table> </div> <h3 
id='logistic-regression-classification' class='header-anchors'>Logistic 
Regression Classification</h3><p>To use the alternative multinomial logistic 
regression algorithm change your <code>engine.json</code> as follows:</p><div 
class="highlight json"><table style="border-spacing: 0"><tbody><tr><td 
class="gutter gl" style="text-align: right"><pre class="lineno">1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23</pre></td><td class="code"><pre><span class="w">  </span><span 
class="p">{</span><span class="w">
+  </span><span class="s2">"id"</span><span class="p">:</span><span class="w"> 
</span><span class="s2">"default"</span><span class="p">,</span><span class="w">
+  </span><span class="s2">"description"</span><span class="p">:</span><span 
class="w"> </span><span class="s2">"Default settings"</span><span 
class="p">,</span><span class="w">
+  </span><span class="s2">"engineFactory"</span><span class="p">:</span><span 
class="w"> </span><span 
class="s2">"org.template.textclassification.TextClassificationEngine"</span><span
 class="p">,</span><span class="w">
+  </span><span class="s2">"datasource"</span><span class="p">:</span><span 
class="w"> </span><span class="p">{</span><span class="w">
+    </span><span class="s2">"params"</span><span class="p">:</span><span 
class="w"> </span><span class="p">{</span><span class="w">
+      </span><span class="s2">"appName"</span><span class="p">:</span><span 
class="w"> </span><span class="s2">"MyTextApp"</span><span class="w">
+    </span><span class="p">}</span><span class="w">
+  </span><span class="p">},</span><span class="w">
+  </span><span class="s2">"preparator"</span><span class="p">:</span><span 
class="w"> </span><span class="p">{</span><span class="w">
+    </span><span class="s2">"params"</span><span class="p">:</span><span 
class="w"> </span><span class="p">{</span><span class="w">
+      </span><span class="s2">"nGram"</span><span class="p">:</span><span 
class="w"> </span><span class="mi">2</span><span class="w">
+    </span><span class="p">}</span><span class="w">
+  </span><span class="p">},</span><span class="w">
+  </span><span class="s2">"algorithms"</span><span class="p">:</span><span 
class="w"> </span><span class="p">[</span><span class="w">
+    </span><span class="p">{</span><span class="w">
+      </span><span class="s2">"name"</span><span class="p">:</span><span 
class="w"> </span><span class="s2">"regParam"</span><span 
class="p">,</span><span class="w">
+      </span><span class="s2">"params"</span><span class="p">:</span><span 
class="w"> </span><span class="p">{</span><span class="w">
+        </span><span class="s2">"regParam"</span><span class="p">:</span><span 
class="w"> </span><span class="mf">0.1</span><span class="w">
+      </span><span class="p">}</span><span class="w">
+    </span><span class="p">}</span><span class="w">
+  </span><span class="p">]</span><span class="w">
+</span><span class="p">}</span><span class="w">
+</span></pre></td></tr></tbody></table> </div> <h2 
id='serving:-delivering-the-final-prediction' class='header-anchors'>Serving: 
Delivering the Final Prediction</h2><p>The serving component is the final stage 
in the engine, and in a sense, the most important. This is the final stage in 
which you combine the results obtained from the different models you choose to 
run. The Serving class extends the <a 
href="https://predictionio.incubator.apache.org/api/current/#org.apache.predictionio.controller.LServing";>LServing</a>
 class which must implement a method called serve. This takes a query and an 
associated sequence of predicted results, which contains the predicted results 
from the different algorithms that are implemented in your engine, and combines 
the results to yield a final prediction. It is this final prediction that you 
will receive after sending a query.</p><p>For example, you could choose to 
slightly modify the implementation to return class probabilities coming from a 
mixture
  of model estimates for class probabilities, or any other technique you could 
conceive for combining your results. The default engine setting has this set to 
yield the label from the model predicting with greater confidence.</p><h2 
id='evaluation:-model-assessment-and-selection' 
class='header-anchors'>Evaluation: Model Assessment and Selection</h2><p> A 
predictive model needs to be evaluated to see how it will generalize to future 
observa


<TRUNCATED>
http://git-wip-us.apache.org/repos/asf/incubator-predictionio-site/blob/3897c890/demo/textclassification/index.html.gz
----------------------------------------------------------------------
diff --git a/demo/textclassification/index.html.gz 
b/demo/textclassification/index.html.gz
new file mode 100644
index 0000000..c73dbb7
Binary files /dev/null and b/demo/textclassification/index.html.gz differ

[36/51] [abbrv] [partial] incubator-predictionio-site git commit: Documentation based on apache/incubator-predictionio#d8ee0c8ffdd27d3f2bbe9560b229bc36ee966f9d

Reply via email to