http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-pmml-model-export.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-pmml-model-export.html b/site/docs/2.1.0/mllib-pmml-model-export.html index 30815e0..3f2fd91 100644 --- a/site/docs/2.1.0/mllib-pmml-model-export.html +++ b/site/docs/2.1.0/mllib-pmml-model-export.html @@ -307,8 +307,8 @@ <ul id="markdown-toc"> - <li><a href="#sparkmllib-supported-models" id="markdown-toc-sparkmllib-supported-models"><code>spark.mllib</code> supported models</a></li> - <li><a href="#examples" id="markdown-toc-examples">Examples</a></li> + <li><a href="#sparkmllib-supported-models"><code>spark.mllib</code> supported models</a></li> + <li><a href="#examples">Examples</a></li> </ul> <h2 id="sparkmllib-supported-models"><code>spark.mllib</code> supported models</h2> @@ -353,32 +353,31 @@ <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.clustering.KMeans"><code>KMeans</code> Scala docs</a> and <a href="api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$"><code>Vectors</code> Scala docs</a> for details on the API.</p> - <p>Here a complete example of building a KMeansModel and print it out in PMML format:</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.clustering.KMeans</span> -<span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> + <p>Here a complete example of building a KMeansModel and print it out in PMML format: +<div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.clustering.KMeans</span> +<span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span></p> -<span class="c1">// Load and parse the data</span> + <p><span class="c1">// Load and parse the data</span> <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"data/mllib/kmeans_data.txt"</span><span class="o">)</span> -<span class="k">val</span> <span class="n">parsedData</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">s</span> <span class="k">=></span> <span class="nc">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="sc">' '</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toDouble</span><span class="o">))).</span><span class="n">cache</span><span class="o">()</span> +<span class="k">val</span> <span class="n">parsedData</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">s</span> <span class="k">=></span> <span class="nc">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="sc">' '</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toDouble</span><span class="o">))).</span><span class="n">cache</span><span class="o">()</span></p> -<span class="c1">// Cluster the data into two classes using KMeans</span> + <p><span class="c1">// Cluster the data into two classes using KMeans</span> <span class="k">val</span> <span class="n">numClusters</span> <span class="k">=</span> <span class="mi">2</span> <span class="k">val</span> <span class="n">numIterations</span> <span class="k">=</span> <span class="mi">20</span> -<span class="k">val</span> <span class="n">clusters</span> <span class="k">=</span> <span class="nc">KMeans</span><span class="o">.</span><span class="n">train</span><span class="o">(</span><span class="n">parsedData</span><span class="o">,</span> <span class="n">numClusters</span><span class="o">,</span> <span class="n">numIterations</span><span class="o">)</span> +<span class="k">val</span> <span class="n">clusters</span> <span class="k">=</span> <span class="nc">KMeans</span><span class="o">.</span><span class="n">train</span><span class="o">(</span><span class="n">parsedData</span><span class="o">,</span> <span class="n">numClusters</span><span class="o">,</span> <span class="n">numIterations</span><span class="o">)</span></p> -<span class="c1">// Export to PMML to a String in PMML format</span> -<span class="n">println</span><span class="o">(</span><span class="s">"PMML Model:\n"</span> <span class="o">+</span> <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">)</span> + <p><span class="c1">// Export to PMML to a String in PMML format</span> +<span class="n">println</span><span class="o">(</span><span class="s">"PMML Model:\n"</span> <span class="o">+</span> <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">)</span></p> -<span class="c1">// Export the model to a local file in PMML format</span> -<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="s">"/tmp/kmeans.xml"</span><span class="o">)</span> + <p><span class="c1">// Export the model to a local file in PMML format</span> +<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="s">"/tmp/kmeans.xml"</span><span class="o">)</span></p> -<span class="c1">// Export the model to a directory on a distributed file system in PMML format</span> -<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">"/tmp/kmeans"</span><span class="o">)</span> + <p><span class="c1">// Export the model to a directory on a distributed file system in PMML format</span> +<span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">"/tmp/kmeans"</span><span class="o">)</span></p> -<span class="c1">// Export the model to the OutputStream in PMML format</span> + <p><span class="c1">// Export the model to the OutputStream in PMML format</span> <span class="n">clusters</span><span class="o">.</span><span class="n">toPMML</span><span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="n">out</span><span class="o">)</span> -</pre></div> - <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala" in the Spark repo.</small></div> +</pre></div><div><small>Find full example code at “examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala” in the Spark repo.</small></div></p> <p>For unsupported models, either you will not find a <code>.toPMML</code> method or an <code>IllegalArgumentException</code> will be thrown.</p>
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-statistics.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/mllib-statistics.html b/site/docs/2.1.0/mllib-statistics.html index 4485ecf..f04924c 100644 --- a/site/docs/2.1.0/mllib-statistics.html +++ b/site/docs/2.1.0/mllib-statistics.html @@ -358,15 +358,15 @@ <ul id="markdown-toc"> - <li><a href="#summary-statistics" id="markdown-toc-summary-statistics">Summary statistics</a></li> - <li><a href="#correlations" id="markdown-toc-correlations">Correlations</a></li> - <li><a href="#stratified-sampling" id="markdown-toc-stratified-sampling">Stratified sampling</a></li> - <li><a href="#hypothesis-testing" id="markdown-toc-hypothesis-testing">Hypothesis testing</a> <ul> - <li><a href="#streaming-significance-testing" id="markdown-toc-streaming-significance-testing">Streaming Significance Testing</a></li> + <li><a href="#summary-statistics">Summary statistics</a></li> + <li><a href="#correlations">Correlations</a></li> + <li><a href="#stratified-sampling">Stratified sampling</a></li> + <li><a href="#hypothesis-testing">Hypothesis testing</a> <ul> + <li><a href="#streaming-significance-testing">Streaming Significance Testing</a></li> </ul> </li> - <li><a href="#random-data-generation" id="markdown-toc-random-data-generation">Random data generation</a></li> - <li><a href="#kernel-density-estimation" id="markdown-toc-kernel-density-estimation">Kernel density estimation</a></li> + <li><a href="#random-data-generation">Random data generation</a></li> + <li><a href="#kernel-density-estimation">Kernel density estimation</a></li> </ul> <p><code>\[ @@ -401,7 +401,7 @@ total count.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.</span><span class="o">{</span><span class="nc">MultivariateStatisticalSummary</span><span class="o">,</span> <span class="nc">Statistics</span><span class="o">}</span> <span class="k">val</span> <span class="n">observations</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span> @@ -430,7 +430,7 @@ total count.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html"><code>MultivariateStatisticalSummary</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vector</span><span class="o">;</span> @@ -463,19 +463,19 @@ total count.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span> <span class="n">mat</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">200.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">,</span> <span class="mf">300.0</span><span class="p">])]</span> -<span class="p">)</span> <span class="c"># an RDD of Vectors</span> +<span class="p">)</span> <span class="c1"># an RDD of Vectors</span> -<span class="c"># Compute column summary statistics.</span> +<span class="c1"># Compute column summary statistics.</span> <span class="n">summary</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">colStats</span><span class="p">(</span><span class="n">mat</span><span class="p">)</span> -<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="c"># a dense vector containing the mean value for each column</span> -<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">variance</span><span class="p">())</span> <span class="c"># column-wise variance</span> -<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">numNonzeros</span><span class="p">())</span> <span class="c"># number of nonzeros in each column</span> +<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="c1"># a dense vector containing the mean value for each column</span> +<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">variance</span><span class="p">())</span> <span class="c1"># column-wise variance</span> +<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="o">.</span><span class="n">numNonzeros</span><span class="p">())</span> <span class="c1"># number of nonzeros in each column</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/summary_statistics_example.py" in the Spark repo.</small></div> </div> @@ -496,7 +496,7 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.Statistics$"><code>Statistics</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span> <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span> @@ -507,7 +507,7 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor <span class="c1">// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a</span> <span class="c1">// method is not specified, Pearson's method will be used by default.</span> <span class="k">val</span> <span class="n">correlation</span><span class="k">:</span> <span class="kt">Double</span> <span class="o">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="o">(</span><span class="n">seriesX</span><span class="o">,</span> <span class="n">seriesY</span><span class="o">,</span> <span class="s">"pearson"</span><span class="o">)</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"Correlation is: $correlation"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"Correlation is: </span><span class="si">$correlation</span><span class="s">"</span><span class="o">)</span> <span class="k">val</span> <span class="n">data</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Vector</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span> <span class="nc">Seq</span><span class="o">(</span> @@ -531,7 +531,7 @@ a <code>JavaRDD<Vector></code>, the output will be a <code>Double</code> o <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/Statistics.html"><code>Statistics</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaDoubleRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> @@ -577,23 +577,23 @@ an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the cor <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span> -<span class="n">seriesX</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span> <span class="c"># a series</span> -<span class="c"># seriesY must have the same number of partitions and cardinality as seriesX</span> +<span class="n">seriesX</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span> <span class="c1"># a series</span> +<span class="c1"># seriesY must have the same number of partitions and cardinality as seriesX</span> <span class="n">seriesY</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">11.0</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">555.0</span><span class="p">])</span> -<span class="c"># Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.</span> -<span class="c"># If a method is not specified, Pearson's method will be used by default.</span> -<span class="k">print</span><span class="p">(</span><span class="s">"Correlation is: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">seriesX</span><span class="p">,</span> <span class="n">seriesY</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"pearson"</span><span class="p">)))</span> +<span class="c1"># Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.</span> +<span class="c1"># If a method is not specified, Pearson's method will be used by default.</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"Correlation is: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">seriesX</span><span class="p">,</span> <span class="n">seriesY</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">"pearson"</span><span class="p">)))</span> <span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">200.0</span><span class="p">]),</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">33.0</span><span class="p">,</span> <span class="mf">366.0</span><span class="p">])]</span> -<span class="p">)</span> <span class="c"># an RDD of Vectors</span> +<span class="p">)</span> <span class="c1"># an RDD of Vectors</span> -<span class="c"># calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.</span> -<span class="c"># If a method is not specified, Pearson's method will be used by default.</span> -<span class="k">print</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">"pearson"</span><span class="p">))</span> +<span class="c1"># calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.</span> +<span class="c1"># If a method is not specified, Pearson's method will be used by default.</span> +<span class="k">print</span><span class="p">(</span><span class="n">Statistics</span><span class="o">.</span><span class="n">corr</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">"pearson"</span><span class="p">))</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/correlations_example.py" in the Spark repo.</small></div> </div> @@ -621,9 +621,9 @@ fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample size, whereas sampling with replacement requires two additional passes.</p> - <div class="highlight"><pre><span class="c1">// an RDD[(K, V)] of any key value pairs</span> + <div class="highlight"><pre><span></span><span class="c1">// an RDD[(K, V)] of any key value pairs</span> <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span> - <span class="nc">Seq</span><span class="o">((</span><span class="mi">1</span><span class="o">,</span> <span class="-Symbol">'a</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="-Symbol">'b</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">'c</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">'d</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="-Symbol">'e</span><span class="err">'</span><span class="o">),</span> <span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="-Symbol">'f</span><span class="err">'</span><sp an class="o">)))</span> + <span class="nc">Seq</span><span class="o">((</span><span class="mi">1</span><span class="o">,</span> <span class="sc">'a'</span><span class="o">),</span> <span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="sc">'b'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">'c'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">'d'</span><span class="o">),</span> <span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="sc">'e'</span><span class="o">),</span> <span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="sc">'f'</span><span class="o">)))</span> <span class="c1">// specify the exact fraction desired from each key</span> <span class="k">val</span> <span class="n">fractions</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="mi">1</span> <span class="o">-></span> <span class="mf">0.1</span><span class="o">,</span> <span class="mi">2</span> <span class="o">-></span> <span class="mf">0.6</span><span class="o">,</span> <span class="mi">3</span> <span class="o">-></span> <span class="mf">0.3</span><span class="o">)</span> @@ -643,7 +643,7 @@ fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample size, whereas sampling with replacement requires two additional passes.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.*</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.*</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">scala.Tuple2</span><span class="o">;</span> @@ -678,10 +678,10 @@ set of keys.</p> <p><em>Note:</em> <code>sampleByKeyExact()</code> is currently not supported in Python.</p> - <div class="highlight"><pre><span class="c"># an RDD of any key value pairs</span> -<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'a'</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'b'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'c'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'d'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">'e'</span><span class="p">),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s">'f'</span><span class="p">)])</span> + <div class="highlight"><pre><span></span><span class="c1"># an RDD of any key value pairs</span> +<span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">'c'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">'d'</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">'e'</span><span class="p">),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s1">'f'</span><span class="p">)])</span> -<span class="c"># specify the exact fraction desired from each key as a dictionary</span> +<span class="c1"># specify the exact fraction desired from each key as a dictionary</span> <span class="n">fractions</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">2</span><span class="p">:</span> <span class="mf">0.6</span><span class="p">,</span> <span class="mi">3</span><span class="p">:</span> <span class="mf">0.3</span><span class="p">}</span> <span class="n">approxSample</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">sampleByKey</span><span class="p">(</span><span class="bp">False</span><span class="p">,</span> <span class="n">fractions</span><span class="p">)</span> @@ -708,7 +708,7 @@ independence tests.</p> run Pearson’s chi-squared tests. The following example demonstrates how to run and interpret hypothesis tests.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg._</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.regression.LabeledPoint</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.test.ChiSqTestResult</span> @@ -722,7 +722,7 @@ hypothesis tests.</p> <span class="k">val</span> <span class="n">goodnessOfFitTestResult</span> <span class="k">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="o">(</span><span class="n">vec</span><span class="o">)</span> <span class="c1">// summary of the test including the p-value, degrees of freedom, test statistic, the method</span> <span class="c1">// used, and the null hypothesis.</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"$goodnessOfFitTestResult\n"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"</span><span class="si">$goodnessOfFitTestResult</span><span class="s">\n"</span><span class="o">)</span> <span class="c1">// a contingency matrix. Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))</span> <span class="k">val</span> <span class="n">mat</span><span class="k">:</span> <span class="kt">Matrix</span> <span class="o">=</span> <span class="nc">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="mi">2</span><span class="o">,</span> <span class="nc">Array</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">,</span> <span class="mf">5.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">4.0</span><span class="o">,</span> <span class="mf">6.0</span><span class="o">))</span> @@ -730,7 +730,7 @@ hypothesis tests.</p> <span class="c1">// conduct Pearson's independence test on the input contingency matrix</span> <span class="k">val</span> <span class="n">independenceTestResult</span> <span class="k">=</span> <span class="nc">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="o">(</span><span class="n">mat</span><span class="o">)</span> <span class="c1">// summary of the test including the p-value, degrees of freedom</span> -<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"$independenceTestResult\n"</span><span class="o">)</span> +<span class="n">println</span><span class="o">(</span><span class="s">s"</span><span class="si">$independenceTestResult</span><span class="s">\n"</span><span class="o">)</span> <span class="k">val</span> <span class="n">obs</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">LabeledPoint</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span> @@ -761,7 +761,7 @@ hypothesis tests.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/test/ChiSqTestResult.html"><code>ChiSqTestResult</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.linalg.Matrices</span><span class="o">;</span> @@ -793,9 +793,9 @@ hypothesis tests.</p> <span class="c1">// an RDD of labeled points</span> <span class="n">JavaRDD</span><span class="o"><</span><span class="n">LabeledPoint</span><span class="o">></span> <span class="n">obs</span> <span class="o">=</span> <span class="n">jsc</span><span class="o">.</span><span class="na">parallelize</span><span class="o">(</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span> - <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">)),</span> - <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)),</span> - <span class="k">new</span> <span class="nf">LabeledPoint</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="o">))</span> + <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="mf">3.0</span><span class="o">)),</span> + <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)),</span> + <span class="k">new</span> <span class="n">LabeledPoint</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="o">))</span> <span class="o">)</span> <span class="o">);</span> @@ -820,42 +820,42 @@ hypothesis tests.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Matrices</span><span class="p">,</span> <span class="n">Vectors</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.linalg</span> <span class="kn">import</span> <span class="n">Matrices</span><span class="p">,</span> <span class="n">Vectors</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.regression</span> <span class="kn">import</span> <span class="n">LabeledPoint</span> <span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span> -<span class="n">vec</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">)</span> <span class="c"># a vector composed of the frequencies of events</span> +<span class="n">vec</span> <span class="o">=</span> <span class="n">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">)</span> <span class="c1"># a vector composed of the frequencies of events</span> -<span class="c"># compute the goodness of fit. If a second vector to test against</span> -<span class="c"># is not supplied as a parameter, the test runs against a uniform distribution.</span> +<span class="c1"># compute the goodness of fit. If a second vector to test against</span> +<span class="c1"># is not supplied as a parameter, the test runs against a uniform distribution.</span> <span class="n">goodnessOfFitTestResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">vec</span><span class="p">)</span> -<span class="c"># summary of the test including the p-value, degrees of freedom,</span> -<span class="c"># test statistic, the method used, and the null hypothesis.</span> -<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="se">\n</span><span class="s">"</span> <span class="o">%</span> <span class="n">goodnessOfFitTestResult</span><span class="p">)</span> +<span class="c1"># summary of the test including the p-value, degrees of freedom,</span> +<span class="c1"># test statistic, the method used, and the null hypothesis.</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="se">\n</span><span class="s2">"</span> <span class="o">%</span> <span class="n">goodnessOfFitTestResult</span><span class="p">)</span> -<span class="n">mat</span> <span class="o">=</span> <span class="n">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])</span> <span class="c"># a contingency matrix</span> +<span class="n">mat</span> <span class="o">=</span> <span class="n">Matrices</span><span class="o">.</span><span class="n">dense</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])</span> <span class="c1"># a contingency matrix</span> -<span class="c"># conduct Pearson's independence test on the input contingency matrix</span> +<span class="c1"># conduct Pearson's independence test on the input contingency matrix</span> <span class="n">independenceTestResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">mat</span><span class="p">)</span> -<span class="c"># summary of the test including the p-value, degrees of freedom,</span> -<span class="c"># test statistic, the method used, and the null hypothesis.</span> -<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="se">\n</span><span class="s">"</span> <span class="o">%</span> <span class="n">independenceTestResult</span><span class="p">)</span> +<span class="c1"># summary of the test including the p-value, degrees of freedom,</span> +<span class="c1"># test statistic, the method used, and the null hypothesis.</span> +<span class="k">print</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="se">\n</span><span class="s2">"</span> <span class="o">%</span> <span class="n">independenceTestResult</span><span class="p">)</span> <span class="n">obs</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span> <span class="p">[</span><span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">]),</span> <span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">]),</span> <span class="n">LabeledPoint</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="p">])]</span> -<span class="p">)</span> <span class="c"># LabeledPoint(feature, label)</span> +<span class="p">)</span> <span class="c1"># LabeledPoint(feature, label)</span> -<span class="c"># The contingency table is constructed from an RDD of LabeledPoint and used to conduct</span> -<span class="c"># the independence test. Returns an array containing the ChiSquaredTestResult for every feature</span> -<span class="c"># against the label.</span> +<span class="c1"># The contingency table is constructed from an RDD of LabeledPoint and used to conduct</span> +<span class="c1"># the independence test. Returns an array containing the ChiSquaredTestResult for every feature</span> +<span class="c1"># against the label.</span> <span class="n">featureTestResults</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">chiSqTest</span><span class="p">(</span><span class="n">obs</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">featureTestResults</span><span class="p">):</span> - <span class="k">print</span><span class="p">(</span><span class="s">"Column </span><span class="si">%d</span><span class="s">:</span><span class="se">\n</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span> + <span class="k">print</span><span class="p">(</span><span class="s2">"Column </span><span class="si">%d</span><span class="s2">:</span><span class="se">\n</span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/hypothesis_testing_example.py" in the Spark repo.</small></div> </div> @@ -879,7 +879,7 @@ and interpret the hypothesis tests.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.Statistics$"><code>Statistics</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span> <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span> <span class="k">val</span> <span class="n">data</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span><span class="nc">Seq</span><span class="o">(</span><span class="mf">0.1</span><span class="o">,</span> <span class="mf">0.15</span><span class="o">,</span> <span class="mf">0.2</span><span class="o">,</span> <span class="mf">0.3</span><span class="o">,</span> <span class="mf">0.25</span><span class="o">))</span> <span class="c1">// an RDD of sample data</span> @@ -906,7 +906,7 @@ and interpret the hypothesis tests.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/Statistics.html"><code>Statistics</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaDoubleRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.Statistics</span><span class="o">;</span> @@ -929,16 +929,16 @@ and interpret the hypothesis tests.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">Statistics</span> <span class="n">parallelData</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">])</span> -<span class="c"># run a KS test for the sample versus a standard normal distribution</span> -<span class="n">testResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">kolmogorovSmirnovTest</span><span class="p">(</span><span class="n">parallelData</span><span class="p">,</span> <span class="s">"norm"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> -<span class="c"># summary of the test including the p-value, test statistic, and null hypothesis</span> -<span class="c"># if our p-value indicates significance, we can reject the null hypothesis</span> -<span class="c"># Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with</span> -<span class="c"># a lambda to calculate the CDF is not made available in the Python API</span> +<span class="c1"># run a KS test for the sample versus a standard normal distribution</span> +<span class="n">testResult</span> <span class="o">=</span> <span class="n">Statistics</span><span class="o">.</span><span class="n">kolmogorovSmirnovTest</span><span class="p">(</span><span class="n">parallelData</span><span class="p">,</span> <span class="s2">"norm"</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> +<span class="c1"># summary of the test including the p-value, test statistic, and null hypothesis</span> +<span class="c1"># if our p-value indicates significance, we can reject the null hypothesis</span> +<span class="c1"># Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with</span> +<span class="c1"># a lambda to calculate the CDF is not made available in the Python API</span> <span class="k">print</span><span class="p">(</span><span class="n">testResult</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/hypothesis_testing_kolmogorov_smirnov_test_example.py" in the Spark repo.</small></div> @@ -967,7 +967,7 @@ all prior batches.</li> <p><a href="api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest"><code>StreamingTest</code></a> provides streaming hypothesis testing.</p> - <div class="highlight"><pre><span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">textFileStream</span><span class="o">(</span><span class="n">dataDir</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span> + <div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">textFileStream</span><span class="o">(</span><span class="n">dataDir</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span> <span class="k">case</span> <span class="nc">Array</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">)</span> <span class="k">=></span> <span class="nc">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">.</span><span class="n">toBoolean</span><span class="o">,</span> <span class="n">value</span><span class="o">.</span><span class="n">toDouble</span><span class="o">)</span> <span class="o">})</span> @@ -986,7 +986,7 @@ provides streaming hypothesis testing.</p> <p><a href="api/java/index.html#org.apache.spark.mllib.stat.test.StreamingTest"><code>StreamingTest</code></a> provides streaming hypothesis testing.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.BinarySample</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.BinarySample</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.StreamingTest</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.test.StreamingTestResult</span><span class="o">;</span> @@ -997,11 +997,11 @@ provides streaming hypothesis testing.</p> <span class="n">String</span><span class="o">[]</span> <span class="n">ts</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">","</span><span class="o">);</span> <span class="kt">boolean</span> <span class="n">label</span> <span class="o">=</span> <span class="n">Boolean</span><span class="o">.</span><span class="na">parseBoolean</span><span class="o">(</span><span class="n">ts</span><span class="o">[</span><span class="mi">0</span><span class="o">]);</span> <span class="kt">double</span> <span class="n">value</span> <span class="o">=</span> <span class="n">Double</span><span class="o">.</span><span class="na">parseDouble</span><span class="o">(</span><span class="n">ts</span><span class="o">[</span><span class="mi">1</span><span class="o">]);</span> - <span class="k">return</span> <span class="k">new</span> <span class="nf">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span> + <span class="k">return</span> <span class="k">new</span> <span class="n">BinarySample</span><span class="o">(</span><span class="n">label</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span> <span class="o">}</span> <span class="o">});</span> -<span class="n">StreamingTest</span> <span class="n">streamingTest</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">StreamingTest</span><span class="o">()</span> +<span class="n">StreamingTest</span> <span class="n">streamingTest</span> <span class="o">=</span> <span class="k">new</span> <span class="n">StreamingTest</span><span class="o">()</span> <span class="o">.</span><span class="na">setPeacePeriod</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span> <span class="o">.</span><span class="na">setWindowSize</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span> <span class="o">.</span><span class="na">setTestMethod</span><span class="o">(</span><span class="s">"welch"</span><span class="o">);</span> @@ -1028,7 +1028,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$"><code>RandomRDDs</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span> + <figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span></span><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span> <span class="k">import</span> <span class="nn">org.apache.spark.mllib.random.RandomRDDs._</span> <span class="k">val</span> <span class="n">sc</span><span class="k">:</span> <span class="kt">SparkContext</span> <span class="o">=</span> <span class="o">...</span> @@ -1037,7 +1037,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p> <span class="c1">// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span> <span class="k">val</span> <span class="n">u</span> <span class="k">=</span> <span class="n">normalRDD</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="mi">1000000L</span><span class="o">,</span> <span class="mi">10</span><span class="o">)</span> <span class="c1">// Apply a transform to get a random double RDD following `N(1, 4)`.</span> -<span class="k">val</span> <span class="n">v</span> <span class="k">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">)</span></code></pre></div> +<span class="k">val</span> <span class="n">v</span> <span class="k">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">)</span></code></pre></figure> </div> @@ -1049,9 +1049,9 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/random/RandomRDDs"><code>RandomRDDs</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span> + <figure class="highlight"><pre><code class="language-java" data-lang="java"><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.JavaDoubleRDD</span><span class="o">;</span> -<span class="kn">import</span> <span class="nn">static</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">mllib</span><span class="o">.</span><span class="na">random</span><span class="o">.</span><span class="na">RandomRDDs</span><span class="o">.*;</span> +<span class="kn">import static</span> <span class="nn">org.apache.spark.mllib.random.RandomRDDs.*</span><span class="o">;</span> <span class="n">JavaSparkContext</span> <span class="n">jsc</span> <span class="o">=</span> <span class="o">...</span> @@ -1064,7 +1064,7 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p> <span class="kd">public</span> <span class="n">Double</span> <span class="nf">call</span><span class="o">(</span><span class="n">Double</span> <span class="n">x</span><span class="o">)</span> <span class="o">{</span> <span class="k">return</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="o">;</span> <span class="o">}</span> - <span class="o">});</span></code></pre></div> + <span class="o">});</span></code></pre></figure> </div> @@ -1076,15 +1076,15 @@ distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs"><code>RandomRDDs</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.mllib.random</span> <span class="kn">import</span> <span class="n">RandomRDDs</span> + <figure class="highlight"><pre><code class="language-python" data-lang="python"><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.random</span> <span class="kn">import</span> <span class="n">RandomRDDs</span> -<span class="n">sc</span> <span class="o">=</span> <span class="o">...</span> <span class="c"># SparkContext</span> +<span class="n">sc</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># SparkContext</span> -<span class="c"># Generate a random double RDD that contains 1 million i.i.d. values drawn from the</span> -<span class="c"># standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span> +<span class="c1"># Generate a random double RDD that contains 1 million i.i.d. values drawn from the</span> +<span class="c1"># standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.</span> <span class="n">u</span> <span class="o">=</span> <span class="n">RandomRDDs</span><span class="o">.</span><span class="n">normalRDD</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="il">1000000L</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> -<span class="c"># Apply a transform to get a random double RDD following `N(1, 4)`.</span> -<span class="n">v</span> <span class="o">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span></code></pre></div> +<span class="c1"># Apply a transform to get a random double RDD following `N(1, 4)`.</span> +<span class="n">v</span> <span class="o">=</span> <span class="n">u</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mf">1.0</span> <span class="o">+</span> <span class="mf">2.0</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span></code></pre></figure> </div> </div> @@ -1107,7 +1107,7 @@ to do so.</p> <p>Refer to the <a href="api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity"><code>KernelDensity</code> Scala docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span> <span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span> <span class="c1">// an RDD of sample data</span> @@ -1132,7 +1132,7 @@ to do so.</p> <p>Refer to the <a href="api/java/org/apache/spark/mllib/stat/KernelDensity.html"><code>KernelDensity</code> Java docs</a> for details on the API.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.mllib.stat.KernelDensity</span><span class="o">;</span> @@ -1143,7 +1143,7 @@ to do so.</p> <span class="c1">// Construct the density estimator with the sample data</span> <span class="c1">// and a standard deviation for the Gaussian kernels</span> -<span class="n">KernelDensity</span> <span class="n">kd</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">KernelDensity</span><span class="o">().</span><span class="na">setSample</span><span class="o">(</span><span class="n">data</span><span class="o">).</span><span class="na">setBandwidth</span><span class="o">(</span><span class="mf">3.0</span><span class="o">);</span> +<span class="n">KernelDensity</span> <span class="n">kd</span> <span class="o">=</span> <span class="k">new</span> <span class="n">KernelDensity</span><span class="o">().</span><span class="na">setSample</span><span class="o">(</span><span class="n">data</span><span class="o">).</span><span class="na">setBandwidth</span><span class="o">(</span><span class="mf">3.0</span><span class="o">);</span> <span class="c1">// Find density estimates for the given values</span> <span class="kt">double</span><span class="o">[]</span> <span class="n">densities</span> <span class="o">=</span> <span class="n">kd</span><span class="o">.</span><span class="na">estimate</span><span class="o">(</span><span class="k">new</span> <span class="kt">double</span><span class="o">[]{-</span><span class="mf">1.0</span><span class="o">,</span> <span class="mf">2.0</span><span class="o">,</span> <span class="mf">5.0</span><span class="o">});</span> @@ -1160,18 +1160,18 @@ to do so.</p> <p>Refer to the <a href="api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity"><code>KernelDensity</code> Python docs</a> for more details on the API.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">KernelDensity</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.mllib.stat</span> <span class="kn">import</span> <span class="n">KernelDensity</span> -<span class="c"># an RDD of sample data</span> +<span class="c1"># an RDD of sample data</span> <span class="n">data</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">,</span> <span class="mf">7.0</span><span class="p">,</span> <span class="mf">8.0</span><span class="p">,</span> <span class="mf">9.0</span><span class="p">,</span> <span class="mf">9.0</span><span class="p">])</span> -<span class="c"># Construct the density estimator with the sample data and a standard deviation for the Gaussian</span> -<span class="c"># kernels</span> +<span class="c1"># Construct the density estimator with the sample data and a standard deviation for the Gaussian</span> +<span class="c1"># kernels</span> <span class="n">kd</span> <span class="o">=</span> <span class="n">KernelDensity</span><span class="p">()</span> <span class="n">kd</span><span class="o">.</span><span class="n">setSample</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="n">kd</span><span class="o">.</span><span class="n">setBandwidth</span><span class="p">(</span><span class="mf">3.0</span><span class="p">)</span> -<span class="c"># Find density estimates for the given values</span> +<span class="c1"># Find density estimates for the given values</span> <span class="n">densities</span> <span class="o">=</span> <span class="n">kd</span><span class="o">.</span><span class="n">estimate</span><span class="p">([</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/mllib/kernel_density_estimation_example.py" in the Spark repo.</small></div> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org