date:20161228

spark git commit: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand

2016-12-28 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master 93f35569f -> 7d19b6ab7


[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand

## What changes were proposed in this pull request?

The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a 
lot of work to do if the table already exists:

1. throw exception if we don't want to ignore it.
2. do some check and adjust the schema if we want to append data.
3. drop the table and create it again if we want to overwrite.

The work 2 and 3 should be done by analyzer, so that we can also apply it to 
hive tables.

## How was this patch tested?

existing tests.

Author: Wenchen Fan 

Closes #15996 from cloud-fan/append.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d19b6ab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d19b6ab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d19b6ab

Branch: refs/heads/master
Commit: 7d19b6ab7d75b95d9eb1c7e1f228d23fd482306e
Parents: 93f3556
Author: Wenchen Fan 
Authored: Wed Dec 28 21:50:21 2016 -0800
Committer: Yin Huai 
Committed: Wed Dec 28 21:50:21 2016 -0800

--
 .../org/apache/spark/sql/DataFrameWriter.scala  |  78 +
 .../command/createDataSourceTables.scala| 167 +--
 .../spark/sql/execution/datasources/rules.scala | 164 +-
 .../sql/hive/MetastoreDataSourcesSuite.scala|   2 +-
 4 files changed, 213 insertions(+), 198 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7d19b6ab/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 9c5660a..405f38a 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -23,11 +23,12 @@ import scala.collection.JavaConverters._
 
 import org.apache.spark.annotation.InterfaceStability
 import org.apache.spark.sql.catalyst.TableIdentifier
-import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, 
UnresolvedRelation}
 import org.apache.spark.sql.catalyst.catalog.{BucketSpec, CatalogTable, 
CatalogTableType}
 import org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable
 import org.apache.spark.sql.execution.command.DDLUtils
-import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource}
+import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource, 
LogicalRelation}
+import org.apache.spark.sql.sources.BaseRelation
 import org.apache.spark.sql.types.StructType
 
 /**
@@ -364,7 +365,11 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   throw new AnalysisException("Cannot create hive serde table with 
saveAsTable API")
 }
 
-val tableExists = 
df.sparkSession.sessionState.catalog.tableExists(tableIdent)
+val catalog = df.sparkSession.sessionState.catalog
+val tableExists = catalog.tableExists(tableIdent)
+val db = tableIdent.database.getOrElse(catalog.getCurrentDatabase)
+val tableIdentWithDB = tableIdent.copy(database = Some(db))
+val tableName = tableIdentWithDB.unquotedString
 
 (tableExists, mode) match {
   case (true, SaveMode.Ignore) =>
@@ -373,39 +378,48 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   case (true, SaveMode.ErrorIfExists) =>
 throw new AnalysisException(s"Table $tableIdent already exists.")
 
-  case _ =>
-val existingTable = if (tableExists) {
-  
Some(df.sparkSession.sessionState.catalog.getTableMetadata(tableIdent))
-} else {
-  None
+  case (true, SaveMode.Overwrite) =>
+// Get all input data source relations of the query.
+val srcRelations = df.logicalPlan.collect {
+  case LogicalRelation(src: BaseRelation, _, _) => src
 }
-val storage = if (tableExists) {
-  existingTable.get.storage
-} else {
-  DataSource.buildStorageFormatFromOptions(extraOptions.toMap)
-}
-val tableType = if (tableExists) {
-  existingTable.get.tableType
-} else if (storage.locationUri.isDefined) {
-  CatalogTableType.EXTERNAL
-} else {
-  CatalogTableType.MANAGED
+EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) 
match {
+  // Only do the check if the table is a data source table (the 
relation is a BaseRelation).
+  case LogicalRelation(dest: BaseRelation, _, _) if 
srcRelations.contains(dest) =>
+throw new AnalysisException(
+

spark git commit: [SPARK-16213][SQL] Reduce runtime overhead of a program that creates an primitive array in DataFrame

2016-12-28 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 092c6725b -> 93f35569f


[SPARK-16213][SQL] Reduce runtime overhead of a program that creates an 
primitive array in DataFrame

## What changes were proposed in this pull request?

This PR reduces runtime overhead of a program the creates an primitive array in 
DataFrame by using the similar approach to #15044. Generated code performs 
boxing operation in an assignment from InternalRow to an `Object[]` temporary 
array (at Lines 051 and 061 in the generated code before without this PR). If 
we know that type of array elements is primitive, we apply the following 
optimizations:
1. Eliminate a pair of `isNullAt()` and a null assignment
2. Allocate an primitive array instead of `Object[]` (eliminate boxing 
operations)
3. Create `UnsafeArrayData` by using `UnsafeArrayWriter` to keep a primitive 
array in a row format instead of doing non-lightweight operations in 
constructor of `GenericArrayData`
The PR also performs the same things for `CreateMap`.

Here are performance results of [DataFrame 
programs](https://github.com/kiszk/spark/blob/6bf54ec5e227689d69f6db991e9ecbc54e153d0a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/PrimitiveArrayBenchmark.scala#L83-L112)
 by up to 17.9x over without this PR.

```
Without SPARK-16043
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Read a primitive array in DataFrame: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

Int   3805 / 4150  0.0  
507308.9   1.0X
Double3593 / 3852  0.0  
479056.9   1.1X

With SPARK-16043
Read a primitive array in DataFrame: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

Int213 /  271  0.0   
28387.5   1.0X
Double 204 /  223  0.0   
27250.9   1.0X
```
Note : #15780 is enabled for these measurements

An motivating example

``` java
val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
df.selectExpr("Array(value + 1.1d, value + 2.2d)").show
```

Generated code without this PR

``` java
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
/* 009 */   private UnsafeRow serializefromobject_result;
/* 010 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
serializefromobject_holder;
/* 011 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
serializefromobject_rowWriter;
/* 012 */   private Object[] project_values;
/* 013 */   private UnsafeRow project_result;
/* 014 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
/* 015 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
project_rowWriter;
/* 016 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
project_arrayWriter;
/* 017 */
/* 018 */   public GeneratedIterator(Object[] references) {
/* 019 */ this.references = references;
/* 020 */   }
/* 021 */
/* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 023 */ partitionIndex = index;
/* 024 */ this.inputs = inputs;
/* 025 */ inputadapter_input = inputs[0];
/* 026 */ serializefromobject_result = new UnsafeRow(1);
/* 027 */ this.serializefromobject_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result,
 0);
/* 028 */ this.serializefromobject_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder,
 1);
/* 029 */ this.project_values = null;
/* 030 */ project_result = new UnsafeRow(1);
/* 031 */ this.project_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 
32);
/* 032 */ this.project_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
 1);
/* 033 */ this.project_arrayWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
/* 034 */
/* 035 */   }
/* 036 */
/* 037 */   protected void processNext() throws java.io.IOException {
/* 038 */ while (inputadapter_input.hasNext()) {
/* 039 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 040 */   double inputadapter_value = inputadapter_row.getDouble(0);
/* 041 */
/* 042 */   final boolean project_isNull = false;
/* 043 */   th

[01/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

Repository: spark-website
Updated Branches:
  refs/heads/asf-site ecf94f284 -> d2bcf1854


http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/submitting-applications.html
--
diff --git a/site/docs/2.1.0/submitting-applications.html 
b/site/docs/2.1.0/submitting-applications.html
index fc18fa9..0c91739 100644
--- a/site/docs/2.1.0/submitting-applications.html
+++ b/site/docs/2.1.0/submitting-applications.html
@@ -151,14 +151,14 @@ packaging them into a .zip or 
.egg.
 This script takes care of setting up the classpath with Spark and its
 dependencies, and can support different cluster managers and deploy modes that 
Spark supports:
 
-./bin/spark-submit \
+./bin/spark-submit \
   --class  \
   --master  \
   --deploy-mode  \
   --conf = \
-  ... # other options
+  ... # other options
    \
-  [application-arguments]
+  [application-arguments]
 
 Some of the commonly used options are:
 
@@ -194,23 +194,23 @@ you can also specify --supervise to make 
sure that the driver is au
 fails with non-zero exit code. To enumerate all such options available to 
spark-submit,
 run it with --help. Here are a few examples of common options:
 
-# Run application locally on 8 cores
+# Run application locally on 8 
cores
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
-  --master local[8] \
+  --master local[8] \
   /path/to/examples.jar \
-  100
+  100
 
-# Run on a Spark standalone cluster in client deploy 
mode
+# Run on a Spark standalone cluster in client deploy 
mode
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master spark://207.184.161.138:7077 \
   --executor-memory 20G \
   --total-executor-cores 100 \
   /path/to/examples.jar \
-  1000
+  1000
 
-# Run on a Spark standalone cluster in cluster deploy mode 
with supervise
+# Run on a Spark standalone cluster in cluster deploy mode 
with supervise
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master spark://207.184.161.138:7077 \
@@ -219,26 +219,26 @@ run it with --help. Here are a few examples 
of common options:
   --executor-memory 20G \
   --total-executor-cores 100 \
   /path/to/examples.jar \
-  1000
+  1000
 
-# Run on a YARN cluster
-export HADOOP_CONF_DIR=XXX
+# Run on a YARN cluster
+export HADOOP_CONF_DIR=XXX
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master yarn \
-  --deploy-mode cluster \  # can be 
client for client mode
+  --deploy-mode cluster \  # can be 
client for client mode
   --executor-memory 20G \
   --num-executors 50 \
   /path/to/examples.jar \
-  1000
+  1000
 
-# Run a Python application on a Spark standalone cluster
+# Run a Python application on a Spark standalone 
cluster
 ./bin/spark-submit \
   --master spark://207.184.161.138:7077 \
   examples/src/main/python/pi.py \
-  1000
+  1000
 
-# Run on a Mesos cluster in cluster deploy mode with 
supervise
+# Run on a Mesos cluster in cluster deploy mode with 
supervise
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master mesos://207.184.161.138:7077 \
@@ -247,7 +247,7 @@ run it with --help. Here are a few examples of 
common options:
   --executor-memory 20G \
   --total-executor-cores 100 \
   http://path/to/examples.jar \
-  1000
+  1000
 
 Master URLs
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/tuning.html
--
diff --git a/site/docs/2.1.0/tuning.html b/site/docs/2.1.0/tuning.html
index ca4ad9f..33a6316 100644
--- a/site/docs/2.1.0/tuning.html
+++ b/site/docs/2.1.0/tuning.html
@@ -129,23 +129,23 @@
 
 
 
-  Data 
Serialization
-  Memory 
Tuning
-  Memory Management Overview
-  Determining Memory 
Consumption
-  Tuning Data Structures
-  Serialized RDD Storage
-  Garbage Collection Tuning
+  Data Serialization
+  Memory Tuning
+  Memory Management 
Overview
+  Determining Memory 
Consumption
+  Tuning Data Structures
+  Serialized RDD Storage
+  Garbage Collection 
Tuning
 
   
-  Other Considerations
-  Level of Parallelism
-  Memory Usage of Reduce 
Tasks
-  Broadcasting Large 
Variables
-  Data 
Locality
+  Other Considerations
+  Level of Parallelism
+  Memory Usage of Reduce 
Tasks
+  Broadcasting Large 
Variables
+  Data Locality
 
   
-  Summary
+  Summary
 
 
 Because of the in-memory nature of most Spark computations, Spark programs 
can be bottlenecked
@@ -194,9 +194,9 @@ in the AllScalaRegistrar from the https://github.com/twitter/chill";>Twi
 
 To register your own custom classes with Kryo, use the 
registerKryoClasses method.
 
-val conf = new SparkConf().setMaster(...).setAppName(...)
+val conf = new SparkConf().setMaster(...).setAppNa

[25/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

This version is built from the docs source code generated by applying 
https://github.com/apache/spark/pull/16294 to v2.1.0 (so, other changes in 
branch 2.1 will not affect the doc).


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/d2bcf185
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/d2bcf185
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/d2bcf185

Branch: refs/heads/asf-site
Commit: d2bcf1854b0e0409495e2f1d3c6beaad923f6e6b
Parents: ecf94f2
Author: Yin Huai 
Authored: Wed Dec 28 14:32:43 2016 -0800
Committer: Yin Huai 
Committed: Wed Dec 28 14:32:43 2016 -0800

--
 site/docs/2.1.0/building-spark.html |  46 +-
 site/docs/2.1.0/building-with-maven.html|  14 +-
 site/docs/2.1.0/configuration.html  |  52 +-
 site/docs/2.1.0/ec2-scripts.html| 174 
 site/docs/2.1.0/graphx-programming-guide.html   | 198 ++---
 site/docs/2.1.0/hadoop-provided.html|  14 +-
 .../img/structured-streaming-watermark.png  | Bin 0 -> 252000 bytes
 site/docs/2.1.0/img/structured-streaming.pptx   | Bin 1105413 -> 1113902 bytes
 site/docs/2.1.0/job-scheduling.html |  40 +-
 site/docs/2.1.0/ml-advanced.html|  10 +-
 .../2.1.0/ml-classification-regression.html | 838 +-
 site/docs/2.1.0/ml-clustering.html  | 124 +--
 site/docs/2.1.0/ml-collaborative-filtering.html |  56 +-
 site/docs/2.1.0/ml-features.html| 764 
 site/docs/2.1.0/ml-migration-guides.html|  16 +-
 site/docs/2.1.0/ml-pipeline.html| 178 ++--
 site/docs/2.1.0/ml-tuning.html  | 172 ++--
 site/docs/2.1.0/mllib-clustering.html   | 186 ++--
 .../2.1.0/mllib-collaborative-filtering.html|  48 +-
 site/docs/2.1.0/mllib-data-types.html   | 208 ++---
 site/docs/2.1.0/mllib-decision-tree.html|  94 +-
 .../2.1.0/mllib-dimensionality-reduction.html   |  28 +-
 site/docs/2.1.0/mllib-ensembles.html| 182 ++--
 site/docs/2.1.0/mllib-evaluation-metrics.html   | 302 +++
 site/docs/2.1.0/mllib-feature-extraction.html   | 122 +--
 .../2.1.0/mllib-frequent-pattern-mining.html|  28 +-
 site/docs/2.1.0/mllib-isotonic-regression.html  |  38 +-
 site/docs/2.1.0/mllib-linear-methods.html   | 174 ++--
 site/docs/2.1.0/mllib-naive-bayes.html  |  24 +-
 site/docs/2.1.0/mllib-optimization.html |  50 +-
 site/docs/2.1.0/mllib-pmml-model-export.html|  35 +-
 site/docs/2.1.0/mllib-statistics.html   | 180 ++--
 site/docs/2.1.0/programming-guide.html  | 302 +++
 site/docs/2.1.0/quick-start.html| 166 ++--
 site/docs/2.1.0/running-on-mesos.html   |  52 +-
 site/docs/2.1.0/running-on-yarn.html|  27 +-
 site/docs/2.1.0/spark-standalone.html   |  30 +-
 site/docs/2.1.0/sparkr.html | 145 ++--
 site/docs/2.1.0/sql-programming-guide.html  | 819 +-
 site/docs/2.1.0/storage-openstack-swift.html|  12 +-
 site/docs/2.1.0/streaming-custom-receivers.html |  26 +-
 .../2.1.0/streaming-kafka-0-10-integration.html |  52 +-
 .../docs/2.1.0/streaming-programming-guide.html | 416 -
 .../structured-streaming-kafka-integration.html |  44 +-
 .../structured-streaming-programming-guide.html | 864 ---
 site/docs/2.1.0/submitting-applications.html|  36 +-
 site/docs/2.1.0/tuning.html |  30 +-
 47 files changed, 3926 insertions(+), 3490 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/building-spark.html
--
diff --git a/site/docs/2.1.0/building-spark.html 
b/site/docs/2.1.0/building-spark.html
index b3a720c..5c20245 100644
--- a/site/docs/2.1.0/building-spark.html
+++ b/site/docs/2.1.0/building-spark.html
@@ -127,33 +127,33 @@
 
 
 
-  Building Apache Spark
-  Apache 
Maven
-  Setting up Maven’s 
Memory Usage
-  build/mvn
+  Building Apache Spark
+  Apache Maven
+  Setting up 
Maven’s Memory Usage
+  build/mvn
 
   
-  Building a Runnable 
Distribution
-  Specifying the Hadoop 
Version
-  Building With Hive and 
JDBC Support
-  Packaging 
without Hadoop Dependencies for YARN
-  Building with Mesos 
support
-  Building for Scala 2.10
-  Building submodules 
individually
-  Continuous Compilation
-  Speeding up Compilation 
with Zinc
-  Building with SBT
-  Â Encrypted Filesystems
-  IntelliJ IDEA or Eclipse
+  Bu

[11/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-linear-methods.html
--
diff --git a/site/docs/2.1.0/mllib-linear-methods.html 
b/site/docs/2.1.0/mllib-linear-methods.html
index 46a1a25..428d778 100644
--- a/site/docs/2.1.0/mllib-linear-methods.html
+++ b/site/docs/2.1.0/mllib-linear-methods.html
@@ -307,23 +307,23 @@
 
 
 
-  Mathematical formulation
-  Loss 
functions
-  Regularizers
-  Optimization
+  Mathematical formulation
+  Loss functions
+  Regularizers
+  Optimization
 
   
-  Classification
-  Linear Support Vector 
Machines (SVMs)
-  Logistic regression
+  Classification
+  Linear Support Vector 
Machines (SVMs)
+  Logistic regression
 
   
-  Regression
-  Linear least 
squares, Lasso, and ridge regression
-  Streaming linear 
regression
+  Regression
+  Linear 
least squares, Lasso, and ridge regression
+  Streaming linear 
regression
 
   
-  Implementation (developer)
+  Implementation (developer)
 
 
 \[
@@ -489,7 +489,7 @@ error.
 
 Refer to the SVMWithSGD
 Scala docs and SVMModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.classification.{SVMModel, 
SVMWithSGD}
+import 
org.apache.spark.mllib.classification.{SVMModel, 
SVMWithSGD}
 import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 import org.apache.spark.mllib.util.MLUtils
 
@@ -534,14 +534,14 @@ this way as well. For example, the following code 
produces an L1 regularized
 variant of SVMs with regularization parameter set to 0.1, and runs the training
 algorithm for 200 iterations.
 
-import org.apache.spark.mllib.optimization.L1Updater
+import org.apache.spark.mllib.optimization.L1Updater
 
 val svmAlg = new SVMWithSGD()
 svmAlg.optimizer
   .setNumIterations(200)
   .setRegParam(0.1)
   .setUpdater(new L1Updater)
-val modelL1 = svmAlg.run(training)
+val modelL1 = svmAlg.run(training)
 
   
 
@@ -554,7 +554,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the SVMWithSGD
 Java docs and SVMModel
 Java docs for details on the API.
 
-import scala.Tuple2;
+import 
scala.Tuple2;
 
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
@@ -591,7 +591,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 // Get evaluation metrics.
 BinaryClassificationMetrics metrics =
-  new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels));
+  new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels));
 double auROC = metrics.areaUnderROC();
 
 System.out.println("Area 
under ROC = " + auROC);
@@ -610,14 +610,14 @@ this way as well. For example, the following code 
produces an L1 regularized
 variant of SVMs with regularization parameter set to 0.1, and runs the training
 algorithm for 200 iterations.
 
-import org.apache.spark.mllib.optimization.L1Updater;
+import org.apache.spark.mllib.optimization.L1Updater;
 
-SVMWithSGD svmAlg = new SVMWithSGD();
+SVMWithSGD svmAlg = new SVMWithSGD();
 svmAlg.optimizer()
   .setNumIterations(200)
   .setRegParam(0.1)
-  .setUpdater(new L1Updater());
-final SVMModel modelL1 = svmAlg.run(training.rdd());
+  .setUpdater(new L1Updater());
+final SVMModel modelL1 = svmAlg.run(training.rdd());
 
 In order to run the above application, follow the instructions
 provided in the Self-Contained
@@ -632,28 +632,28 @@ and make predictions with the resulting model to compute 
the training error.
 
 Refer to the SVMWithSGD
 Python docs and SVMModel
 Python docs for more details on the API.
 
-from pyspark.mllib.classification import 
SVMWithSGD, SVMModel
+from 
pyspark.mllib.classification import SVMWithSGD, SVMModel
 from pyspark.mllib.regression 
import LabeledPoint
 
-# Load and parse the data
+# Load and parse the data
 def parsePoint(line):
-values = [float(x) for x in line.split(' ')]
+values = [float(x) for x in line.split(' ')]
 return LabeledPoint(values[0], values[1:])
 
-data = sc.textFile("data/mllib/sample_svm_data.txt")
+data = sc.textFile("data/mllib/sample_svm_data.txt")
 parsedData = data.map(parsePoint)
 
-# Build the model
+# Build the model
 model = SVMWithSGD.train(parsedData, iterations=100)
 
-# Evaluating the model on training data
+# Evaluating the model on training data
 labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
 trainErr = labelsAndPreds.filter(lambda 
(v, p): v != p).count() / float(parsedData.count())
-print("Training Error = " + str(trainErr))
+print("Training Error = " + str(trainErr))
 
-# Save and load model
-model.save(sc, "target/tmp/pythonSVMWithSGDModel")
-sameModel = SVMModel.load(sc, "target/tmp/pythonSVMWithSGDModel")
+# Save and load model
+model.save(sc, "target/tmp/pythonSVMWithSGDModel")
+sameMo

[06/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/sql-programming-guide.html
--
diff --git a/site/docs/2.1.0/sql-programming-guide.html 
b/site/docs/2.1.0/sql-programming-guide.html
index 17f5981..4534a98 100644
--- a/site/docs/2.1.0/sql-programming-guide.html
+++ b/site/docs/2.1.0/sql-programming-guide.html
@@ -127,95 +127,95 @@
 
 
 
-  Overview
-  SQL
-  Datasets and DataFrames
+  Overview
+  SQL
+  Datasets and DataFrames
 
   
-  Getting 
Started
-  Starting Point: 
SparkSession
-  Creating DataFrames
-  Untyped 
Dataset Operations (aka DataFrame Operations)
-  Running SQL Queries 
Programmatically
-  Global Temporary View
-  Creating Datasets
-  Interoperating with RDDs

-  Inferring the Schema 
Using Reflection
-  Programmatically 
Specifying the Schema
+  Getting Started
+  Starting Point: 
SparkSession
+  Creating DataFrames
+  Untyped Dataset 
Operations (aka DataFrame Operations)
+  Running SQL Queries 
Programmatically
+  Global Temporary View
+  Creating Datasets
+  Interoperating with RDDs 
   
+  Inferring the 
Schema Using Reflection
+  Programmatically Specifying the 
Schema
 
   
 
   
-  Data Sources  
  
-  Generic Load/Save Functions

-  Manually Specifying 
Options
-  Run SQL on files directly
-  Save 
Modes
-  Saving to Persistent 
Tables
+  Data Sources
+  Generic Load/Save 
Functions
+  Manually Specifying 
Options
+  Run SQL on files 
directly
+  Save Modes
+  Saving to Persistent 
Tables
 
   
-  Parquet 
Files
-  Loading Data 
Programmatically
-  Partition Discovery
-  Schema Merging
-  Hive metastore 
Parquet table conversion
-  Hive/Parquet Schema 
Reconciliation
-  Metadata Refreshing
+  Parquet Files
+  Loading Data 
Programmatically
+  Partition Discovery
+  Schema Merging
+  Hive 
metastore Parquet table conversion
+  Hive/Parquet 
Schema Reconciliation
+  Metadata Refreshing
 
   
-  Configuration
+  Configuration
 
   
-  JSON 
Datasets
-  Hive Tables 
   
-  Interacting
 with Different Versions of Hive Metastore
+  JSON Datasets
+  Hive Tables
+  Interacting with 
Different Versions of Hive Metastore
 
   
-  JDBC To Other Databases
-  Troubleshooting
+  JDBC To Other Databases
+  Troubleshooting
 
   
-  Performance Tuning
-  Caching Data In Memory
-  Other Configuration 
Options
+  Performance Tuning
+  Caching Data In Memory
+  Other Configuration 
Options
 
   
-  Distributed SQL Engine
-  Running the Thrift 
JDBC/ODBC server
-  Running the Spark SQL CLI
+  Distributed SQL Engine
+  Running the Thrift 
JDBC/ODBC server
+  Running the Spark SQL 
CLI
 
   
-  Migration 
Guide
-  Upgrading From Spark SQL 
2.0 to 2.1
-  Upgrading From Spark SQL 
1.6 to 2.0
-  Upgrading From Spark SQL 
1.5 to 1.6
-  Upgrading From Spark SQL 
1.4 to 1.5
-  Upgrading from Spark SQL 
1.3 to 1.4
-  DataFrame data 
reader/writer interface
-  DataFrame.groupBy 
retains grouping columns
-  Behavior change on 
DataFrame.withColumn
+  Migration Guide
+  Upgrading From Spark 
SQL 2.0 to 2.1
+  Upgrading From Spark 
SQL 1.6 to 2.0
+  Upgrading From Spark 
SQL 1.5 to 1.6
+  Upgrading From Spark 
SQL 1.4 to 1.5
+  Upgrading from Spark 
SQL 1.3 to 1.4
+  DataFrame data 
reader/writer interface
+  DataFrame.groupBy retains 
grouping columns
+  Behavior 
change on DataFrame.withColumn
 
   
-  Upgrading from Spark SQL 
1.0-1.2 to 1.3
-  Rename of SchemaRDD to 
DataFrame
-  Unification of the 
Java and Scala APIs
-  Isolation
 of Implicit Conversions and Removal of dsl Package (Scala-only)
-  Removal
 of the type aliases in org.apache.spark.sql for DataType (Scala-only)
-  UDF 
Registration Moved to sqlContext.udf (Java & Scala)
-  Python DataTypes No 
Longer Singletons
+  Upgrading from Spark 
SQL 1.0-1.2 to 1.3
+  Rename of SchemaRDD 
to DataFrame
+  Unification of 
the Java and Scala APIs
+  Isolation
 of Implicit Conversions and Removal of dsl Package (Scala-only)
+  Removal
 of the type aliases in org.apache.spark.sql for DataType (Scala-only)
+  UDF Registration 
Moved to sqlContext.udf (Java & Scala)
+  Python 
DataTypes No Longer Sing

[19/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-migration-guides.html
--
diff --git a/site/docs/2.1.0/ml-migration-guides.html 
b/site/docs/2.1.0/ml-migration-guides.html
index 5e8a913..24dfc31 100644
--- a/site/docs/2.1.0/ml-migration-guides.html
+++ b/site/docs/2.1.0/ml-migration-guides.html
@@ -344,21 +344,21 @@ for converting to mllib.linalg types.
 
 
 
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.util.MLUtils
 
 // convert DataFrame columns
 val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
 val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
 // convert a single vector or matrix
 val mlVec: org.apache.spark.ml.linalg.Vector 
= mllibVec.asML
-val mlMat: org.apache.spark.ml.linalg.Matrix 
= mllibMat.asML
+val mlMat: org.apache.spark.ml.linalg.Matrix 
= mllibMat.asML
 
 Refer to the MLUtils
 Scala docs for further detail.
   
 
 
 
-import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.mllib.util.MLUtils;
 import org.apache.spark.sql.Dataset;
 
 // convert DataFrame columns
@@ -366,21 +366,21 @@ for converting to mllib.linalg types.
 Dataset convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
 // convert a single vector or matrix
 org.apache.spark.ml.linalg.Vector mlVec = 
mllibVec.asML();
-org.apache.spark.ml.linalg.Matrix mlMat = 
mllibMat.asML();
+org.apache.spark.ml.linalg.Matrix mlMat = 
mllibMat.asML();
 
 Refer to the MLUtils 
Java docs for further detail.
   
 
 
 
-from pyspark.mllib.util import MLUtils
+from pyspark.mllib.util import MLUtils
 
-# convert DataFrame columns
+# convert DataFrame columns
 convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
 convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-# convert a single vector or matrix
+# convert a single vector or matrix
 mlVec = mllibVec.asML()
-mlMat = mllibMat.asML()
+mlMat = mllibMat.asML()
 
 Refer to the MLUtils
 Python docs for further detail.
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-pipeline.html
--
diff --git a/site/docs/2.1.0/ml-pipeline.html b/site/docs/2.1.0/ml-pipeline.html
index fe17564..b57afde 100644
--- a/site/docs/2.1.0/ml-pipeline.html
+++ b/site/docs/2.1.0/ml-pipeline.html
@@ -331,27 +331,27 @@ machine learning pipelines.
 Table of Contents
 
 
-  Main concepts in Pipelines

-  DataFrame
-  Pipeline components
-  Transformers
-  Estimators
-  Properties of pipeline 
components
+  Main concepts in Pipelines

+  DataFrame
+  Pipeline components
+  Transformers
+  Estimators
+  Properties of 
pipeline components
 
   
-  Pipeline

-  How it 
works
-  Details
+  Pipeline
+  How it works
+  Details
 
   
-  Parameters
-  Saving and Loading 
Pipelines
+  Parameters
+  Saving and Loading 
Pipelines
 
   
-  Code 
examples
-  Example: Estimator, 
Transformer, and Param
-  Example: Pipeline
-  Model selection 
(hyperparameter tuning)
+  Code examples
+  Example: 
Estimator, Transformer, and Param
+  Example: Pipeline
+  Model selection 
(hyperparameter tuning)
 
   
 
@@ -541,7 +541,7 @@ Refer to the [`Estimator` Scala 
docs](api/scala/index.html#org.apache.spark.ml.E
 the [`Transformer` Scala 
docs](api/scala/index.html#org.apache.spark.ml.Transformer) and
 the [`Params` Scala 
docs](api/scala/index.html#org.apache.spark.ml.param.Params) for details on the 
API.
 
-import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.classification.LogisticRegression
 import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamMap
 import org.apache.spark.sql.Row
@@ -601,7 +601,7 @@ the [`Params` Scala 
docs](api/scala/index.html#org.apache.spark.ml.param.Params)
   .select("features", "label", "myProbability", "prediction")
   .collect()
   .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
-println(s"($features, $label) -> prob=$prob, 
prediction=$prediction")
+println(s"($features, 
$label) -> prob=$prob, prediction=$prediction")
   }
 Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala"
 in the Spark repo.
 
@@ -612,7 +612,7 @@ Refer to the [`Estimator` Java 
docs](api/java/org/apache/spark/ml/Estimator.html
 the [`Transformer` Java docs](api/java/org/apache/spark/ml/Transformer.html) 
and
 the [`Params` Java docs](api/java/org/apache/spark/ml/param/Params.html) for 
details on the API.
 
-import java.util.Arrays;
+import java.util.Arrays;
 import java.util.List;
 
 import org.apach

[24/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/graphx-programming-guide.html
--
diff --git a/site/docs/2.1.0/graphx-programming-guide.html 
b/site/docs/2.1.0/graphx-programming-guide.html
index 780d1ab..08b3380 100644
--- a/site/docs/2.1.0/graphx-programming-guide.html
+++ b/site/docs/2.1.0/graphx-programming-guide.html
@@ -129,42 +129,42 @@
 
 
 
-  Overview
-  Getting 
Started
-  The 
Property Graph
-  Example Property Graph
+  Overview
+  Getting Started
+  The Property Graph
+  Example Property Graph
 
   
-  Graph 
Operators
-  Summary List of Operators
-  Property Operators
-  Structural Operators
-  Join 
Operators
-  Neighborhood Aggregation

-  Aggregate Messages 
(aggregateMessages)
-  Map Reduce 
Triplets Transition Guide (Legacy)
-  Computing Degree 
Information
-  Collecting Neighbors
+  Graph Operators
+  Summary List of 
Operators
+  Property Operators
+  Structural Operators
+  Join Operators
+  Neighborhood Aggregation 
   
+  Aggregate 
Messages (aggregateMessages)
+  Map 
Reduce Triplets Transition Guide (Legacy)
+  Computing Degree 
Information
+  Collecting Neighbors
 
   
-  Caching and Uncaching
+  Caching and Uncaching
 
   
-  Pregel API
-  Graph 
Builders
-  Vertex and Edge RDDs
-  VertexRDDs
-  EdgeRDDs
+  Pregel API
+  Graph Builders
+  Vertex and Edge RDDs
+  VertexRDDs
+  EdgeRDDs
 
   
-  Optimized Representation
-  Graph 
Algorithms
-  PageRank
-  Connected Components
-  Triangle Counting
+  Optimized Representation
+  Graph Algorithms
+  PageRank
+  Connected Components
+  Triangle Counting
 
   
-  Examples
+  Examples
 
 
 
@@ -188,10 +188,10 @@ operators (e.g., subgraph, import org.apache.spark._
+import org.apache.spark._
 import org.apache.spark.graphx._
 // To make some of the examples work we will also need 
RDD
-import org.apache.spark.rdd.RDD
+import org.apache.spark.rdd.RDD
 
 If you are not using the Spark shell you will also need a 
SparkContext.  To learn more about
 getting started with Spark refer to the Spark Quick 
Start Guide.
@@ -222,11 +222,11 @@ arrays.
 This can be accomplished through inheritance.  For example to model users and 
products as a
 bipartite graph we might do the following:
 
-class VertexProperty()
+class VertexProperty()
 case class UserProperty(val name: String) extends 
VertexProperty
 case class ProductProperty(val name: String, val 
price: Double) extends 
VertexProperty
 // The graph might then have the type:
-var graph: Graph[VertexProperty, String] = null
+var graph: Graph[VertexProperty, String] = null
 
 Like RDDs, property graphs are immutable, distributed, and fault-tolerant.  
Changes to the values or
 structure of the graph are accomplished by producing a new graph with the 
desired changes.  Note
@@ -239,10 +239,10 @@ RDDs, each partition of the graph can be recreated on a 
different machine in the
 properties for each vertex and edge.  As a consequence, the graph class 
contains members to access
 the vertices and edges of the graph:
 
-class Graph[VD, 
ED] {
+class Graph[VD, 
ED] {
   val vertices: VertexRDD[VD]
   val edges: EdgeRDD[ED]
-}
+}
 
 The classes VertexRDD[VD] and EdgeRDD[ED] extend 
and are optimized versions of RDD[(VertexId,
 VD)] and RDD[Edge[ED]] respectively.  Both 
VertexRDD[VD] and EdgeRDD[ED] provide  additional
@@ -264,7 +264,7 @@ with a string describing the relationships between 
collaborators:
 
 The resulting graph would have the type signature:
 
-val userGraph: Graph[(String, 
String), String]
+val userGraph: Graph[(String, 
String), String]
 
 There are numerous ways to construct a property graph from raw files, RDDs, 
and even synthetic
 generators and these are discussed in more detail in the section on
@@ -272,7 +272,7 @@ generators and these are discussed in more detail in the 
section on
 Graph 
object.  For example the following
 code constructs a graph from a collection of RDDs:
 
-// Assume the SparkContext has already been 
constructed
+// Assume the SparkContext has 
already been constructed
 val sc: 
SparkContext
 // Create an RDD for the vertices
 val users: RDD[(VertexId, (String, String))] =
@@ -285,7 +285,7 @@ code constructs a graph from a collection of RDDs:
 // Define a default user in case there are relationship with 
missing user
 val defaultUser = ("John 
Doe", "Missing")
 // Build the initial Graph
-val graph = Graph(users, relationships, defaultUser)
+val graph = Graph(users, relationships, defaultUser)
 
 In the above example we make use of the Edge 
case class. Edges have a srcId and a
 dstId corresponding to the source and destination vertex 
identifiers. In addition, the Edge
@@ -294,11 +294,11 @@

[08/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/quick-start.html
--
diff --git a/site/docs/2.1.0/quick-start.html b/site/docs/2.1.0/quick-start.html
index 76e67e1..9d5fad7 100644
--- a/site/docs/2.1.0/quick-start.html
+++ b/site/docs/2.1.0/quick-start.html
@@ -129,14 +129,14 @@
 
 
 
-  Interactive 
Analysis with the Spark Shell
-  Basics
-  More on RDD Operations
-  Caching
+  Interactive 
Analysis with the Spark Shell
+  Basics
+  More on RDD Operations
+  Caching
 
   
-  Self-Contained 
Applications
-  Where to Go from Here
+  Self-Contained 
Applications
+  Where to Go from Here
 
 
 This tutorial provides a quick introduction to using Spark. We will first 
introduce the API through Spark’s
@@ -164,26 +164,26 @@ or Python. Start it by running the following in the Spark 
directory:
 
 Spark’s primary abstraction is a distributed collection of items 
called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs. Let’s 
make a new RDD from the text of the README file in the Spark source 
directory:
 
-scala> val textFile = 
sc.textFile("README.md")
-textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at :25
+scala> val textFile = sc.textFile("README.md")
+textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at :25
 
 RDDs have actions, which return values, 
and transformations, which 
return pointers to new RDDs. Let’s start with a few actions:
 
-scala> textFile.count() // Number of 
items in this RDD
+scala> textFile.count() // Number of items in this RDD
 res0: Long = 126 
// May be different from yours as README.md will change over 
time, similar to other outputs
 
 scala> textFile.first() // First item 
in this RDD
-res1: String = # 
Apache Spark
+res1: String = # 
Apache Spark
 
 Now let’s use a transformation. We will use the filter 
transformation to return a new RDD with a subset of the items in the file.
 
-scala> val linesWithSpark = textFile.filter(line 
=> line.contains("Spark"))
-linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at :27
+scala> val linesWithSpark = textFile.filter(line 
=> line.contains("Spark"))
+linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at :27
 
 We can chain together transformations and actions:
 
-scala> textFile.filter(line 
=> line.contains("Spark")).count() // How many 
lines contain "Spark"?
-res3: Long = 15
+scala> textFile.filter(line => line.contains("Spark")).count() // How many 
lines contain "Spark"?
+res3: Long = 15
 
   
 
@@ -193,24 +193,24 @@ or Python. Start it by running the following in the Spark 
directory:
 
 Spark’s primary abstraction is a distributed collection of items 
called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs. Let’s 
make a new RDD from the text of the README file in the Spark source 
directory:
 
->>> textFile = sc.textFile("README.md")
+>>> textFile = sc.textFile("README.md")
 
 RDDs have actions, which return values, 
and transformations, which 
return pointers to new RDDs. Let’s start with a few actions:
 
->>> textFile.count()  # Number of 
items in this RDD
+>>> textFile.count()  # Number of 
items in this RDD
 126
 
->>> textFile.first()  # First item in this RDD
-u'# Apache Spark'
+>>> textFile.first()  # First item in this RDD
+u'# Apache 
Spark'
 
 Now let’s use a transformation. We will use the filter 
transformation to return a new RDD with a subset of the items in the file.
 
->>> linesWithSpark = textFile.filter(lambda 
line: "Spark" in line)
+>>> linesWithSpark = textFile.filter(lambda 
line: "Spark" in line)
 
 We can chain together transformations and actions:
 
->>> textFile.filter(lambda 
line: "Spark" in line).count()  # How many 
lines contain "Spark"?
-15
+>>> textFile.filter(lambda 
line: "Spark" in line).count()  # How many 
lines contain "Spark"?
+15
 
   
 
@@ -221,38 +221,38 @@ or Python. Start it by running the following in the Spark 
directory:
 
 
 
-scala> textFile.map(line => line.split(" 
").size).reduce((a, b) => if (a > b) a else b)
-res4: Long = 15
+scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
+res4: Long = 15
 
 This first maps a line to an integer value, creating a new RDD. 
reduce is called on that RDD to find the largest line count. The 
arguments to map and reduce are Scala function 
literals (closures), and can use any language feature or Scala/Java library. 
For example, we can easily call functions declared elsewhere. We’ll use

[23/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/hadoop-provided.html
--
diff --git a/site/docs/2.1.0/hadoop-provided.html 
b/site/docs/2.1.0/hadoop-provided.html
index ff7afb7..9d77cf0 100644
--- a/site/docs/2.1.0/hadoop-provided.html
+++ b/site/docs/2.1.0/hadoop-provided.html
@@ -133,16 +133,16 @@
 Apache Hadoop
 For Apache distributions, you can use Hadoop’s 
‘classpath’ command. For instance:
 
-### in conf/spark-env.sh ###
+### in conf/spark-env.sh 
###
 
-# If 'hadoop' binary is on your PATH
-export SPARK_DIST_CLASSPATH=$(hadoop classpath)
+# If 'hadoop' binary is on your PATH
+export SPARK_DIST_CLASSPATH=$(hadoop classpath)
 
-# With explicit path to 'hadoop' binary
-export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
+# With explicit path to 'hadoop' binary
+export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
 
-# Passing a Hadoop configuration directory
-export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
+# Passing a Hadoop configuration directory
+export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
 
 
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/img/structured-streaming-watermark.png
--
diff --git a/site/docs/2.1.0/img/structured-streaming-watermark.png 
b/site/docs/2.1.0/img/structured-streaming-watermark.png
new file mode 100644
index 000..f21fbda
Binary files /dev/null and 
b/site/docs/2.1.0/img/structured-streaming-watermark.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/img/structured-streaming.pptx
--
diff --git a/site/docs/2.1.0/img/structured-streaming.pptx 
b/site/docs/2.1.0/img/structured-streaming.pptx
index 6aad2ed..f5bdfc0 100644
Binary files a/site/docs/2.1.0/img/structured-streaming.pptx and 
b/site/docs/2.1.0/img/structured-streaming.pptx differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/job-scheduling.html
--
diff --git a/site/docs/2.1.0/job-scheduling.html 
b/site/docs/2.1.0/job-scheduling.html
index 53161c2..9651607 100644
--- a/site/docs/2.1.0/job-scheduling.html
+++ b/site/docs/2.1.0/job-scheduling.html
@@ -127,24 +127,24 @@
 
 
 
-  Overview
-  Scheduling Across 
Applications
-  Dynamic Resource Allocation   
 
-  Configuration and Setup
-  Resource Allocation Policy 
   
-  Request Policy
-  Remove Policy
+  Overview
+  Scheduling Across 
Applications
+  Dynamic Resource 
Allocation
+  Configuration and 
Setup
+  Resource Allocation 
Policy
+  Request Policy
+  Remove Policy
 
   
-  Graceful Decommission of 
Executors
+  Graceful 
Decommission of Executors
 
   
 
   
-  Scheduling Within an 
Application
-  Fair Scheduler Pools
-  Default Behavior of Pools
-  Configuring Pool 
Properties
+  Scheduling Within an 
Application
+  Fair Scheduler Pools
+  Default Behavior of 
Pools
+  Configuring Pool 
Properties
 
   
 
@@ -321,9 +321,9 @@ mode is best for multi-user settings.
 To enable the fair scheduler, simply set the 
spark.scheduler.mode property to FAIR when configuring
 a SparkContext:
 
-val conf = new SparkConf().setMaster(...).setAppName(...)
+val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.set("spark.scheduler.mode", "FAIR")
-val sc = 
new SparkContext(conf)
+val sc = 
new SparkContext(conf)
 
 Fair Scheduler Pools
 
@@ -337,15 +337,15 @@ many concurrent jobs they have instead of giving 
jobs equal shares. Thi
 adding the spark.scheduler.pool “local property” to 
the SparkContext in the thread that’s submitting them.
 This is done as follows:
 
-// Assuming sc is your SparkContext 
variable
-sc.setLocalProperty("spark.scheduler.pool", "pool1")
+// Assuming sc is your 
SparkContext variable
+sc.setLocalProperty("spark.scheduler.pool", "pool1")
 
 After setting this local property, all jobs submitted within this 
thread (by calls in this thread
 to RDD.save, count, collect, etc) will 
use this pool name. The setting is per-thread to make
 it easy to have a thread run multiple jobs on behalf of the same user. If 
you’d like to clear the
 pool that a thread is associated with, simply call:
 
-sc.setLocalProperty("spark.scheduler.pool", null)
+sc.setLocalProperty("spark.scheduler.pool", null)
 
 Default Behavior of Pools
 
@@ -379,12 +379,12 @@ of the cluster. By default, each pool’s 
minShare is 0.
 and setting a spark.scheduler.allocation.file property in your
 SparkConf.
 
-conf.set("spark.sc

[12/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-feature-extraction.html
--
diff --git a/site/docs/2.1.0/mllib-feature-extraction.html 
b/site/docs/2.1.0/mllib-feature-extraction.html
index 4726b37..f8cd98e 100644
--- a/site/docs/2.1.0/mllib-feature-extraction.html
+++ b/site/docs/2.1.0/mllib-feature-extraction.html
@@ -307,32 +307,32 @@
 
 
 
-  TF-IDF
-  Word2Vec
-  Model
-  Example
+  TF-IDF
+  Word2Vec
+  Model
+  Example
 
   
-  StandardScaler
-  Model 
Fitting
-  Example
+  StandardScaler
+  Model Fitting
+  Example
 
   
-  Normalizer
-  Example
+  Normalizer
+  Example
 
   
-  ChiSqSelector
-  Model 
Fitting
-  Example
+  ChiSqSelector
+  Model Fitting
+  Example
 
   
-  ElementwiseProduct
-  Example
+  ElementwiseProduct
+  Example
 
   
-  PCA
-  Example
+  PCA
+  Example
 
   
 
@@ -390,7 +390,7 @@ Each record could be an iterable of strings or other 
types.
 
 Refer to the HashingTF
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.feature.{HashingTF, IDF}
+import 
org.apache.spark.mllib.feature.{HashingTF, 
IDF}
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.rdd.RDD
 
@@ -424,24 +424,24 @@ Each record could be an iterable of strings or other 
types.
 
 Refer to the HashingTF
 Python docs for details on the API.
 
-from pyspark.mllib.feature import HashingTF, IDF
+from 
pyspark.mllib.feature import 
HashingTF, IDF
 
-# Load documents (one per line).
-documents = sc.textFile("data/mllib/kmeans_data.txt").map(lambda line: line.split(" "))
+# Load documents (one per line).
+documents = sc.textFile("data/mllib/kmeans_data.txt").map(lambda line: line.split(" "))
 
 hashingTF = HashingTF()
 tf = hashingTF.transform(documents)
 
-# While applying HashingTF only needs a single pass to the 
data, applying IDF needs two passes:
-# First to compute the IDF vector and second to scale the term 
frequencies by IDF.
+# While applying HashingTF only needs a single pass to the 
data, applying IDF needs two passes:
+# First to compute the IDF vector and second to scale the 
term frequencies by IDF.
 tf.cache()
 idf = IDF().fit(tf)
 tfidf = idf.transform(tf)
 
-# spark.mllib's IDF implementation provides an option for 
ignoring terms
-# which occur in less than a minimum number of 
documents.
-# In such cases, the IDF for these terms is set to 0.
-# This feature can be used by passing the minDocFreq value to 
the IDF constructor.
+# spark.mllib's IDF implementation provides an option for 
ignoring terms
+# which occur in less than a minimum number of 
documents.
+# In such cases, the IDF for these terms is set to 0.
+# This feature can be used by passing the minDocFreq value to 
the IDF constructor.
 idfIgnore = IDF(minDocFreq=2).fit(tf)
 tfidfIgnore = idfIgnore.transform(tf)
 
@@ -467,7 +467,7 @@ skip-gram model is to maximize the average log-likelihood
 \[
 \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)
 \]
-where $k$ is the size of the training window.
+where $k$ is the size of the training window.  
 
 In the skip-gram model, every word $w$ is associated with two vectors $u_w$ 
and $v_w$ which are 
 vector representations of $w$ as word and context respectively. The 
probability of correctly 
@@ -475,7 +475,7 @@ predicting word $w_i$ given word $w_j$ is determined by the 
softmax model, which
 \[
 p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} 
\exp(u_l^{\top}v_{w_j})}
 \]
-where $V$ is the vocabulary size.
+where $V$ is the vocabulary size. 
 
 The skip-gram model with softmax is expensive because the cost of computing 
$\log p(w_i | w_j)$ 
 is proportional to $V$, which can be easily in order of millions. To speed up 
training of Word2Vec, 
@@ -488,13 +488,13 @@ $O(\log(V))$
 construct a Word2Vec instance and then fit a 
Word2VecModel with the input data. Finally,
 we display the top 40 synonyms of the specified word. To run the example, 
first download
 the http://mattmahoney.net/dc/text8.zip";>text8 data and extract 
it to your preferred directory.
-Here we assume the extracted file is text8 and in same directory 
as you run the spark shell.
+Here we assume the extracted file is text8 and in same directory 
as you run the spark shell.  
 
 
 
 Refer to the Word2Vec
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
+import 
org.apache.spark.mllib.feature.{Word2Vec, 
Word2VecModel}
 
 val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)
 
@@ -505,7 +505,7 @@ Here we assume the extracted file is text8 and 
in same directory as
 val synonyms = model.findSynonyms("1", 5)
 
 for((synonym, cosineSimilarity) <- synon

[17/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-clustering.html
--
diff --git a/site/docs/2.1.0/mllib-clustering.html 
b/site/docs/2.1.0/mllib-clustering.html
index 9667606..1b50dab 100644
--- a/site/docs/2.1.0/mllib-clustering.html
+++ b/site/docs/2.1.0/mllib-clustering.html
@@ -366,12 +366,12 @@ models are trained for each cluster).
 The spark.mllib package supports the following models:
 
 
-  K-means
-  Gaussian 
mixture
-  Power iteration clustering 
(PIC)
-  Latent Dirichlet allocation 
(LDA)
-  Bisecting k-means
-  Streaming k-means
+  K-means
+  Gaussian mixture
+  Power iteration clustering 
(PIC)
+  Latent Dirichlet allocation 
(LDA)
+  Bisecting k-means
+  Streaming k-means
 
 
 K-means
@@ -408,7 +408,7 @@ optimal k is usually one where there is an 
“elbow” in the W
 
 Refer to the KMeans
 Scala docs and KMeansModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
+import 
org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
 import org.apache.spark.mllib.linalg.Vectors
 
 // Load and parse the data
@@ -440,7 +440,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the KMeans
 Java docs and KMeansModel
 Java docs for details on the API.
 
-import org.apache.spark.api.java.JavaRDD;
+import 
org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.mllib.clustering.KMeans;
 import org.apache.spark.mllib.clustering.KMeansModel;
@@ -470,7 +470,7 @@ that is equivalent to the provided example in Scala is 
given below:
 KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
 
 System.out.println("Cluster 
centers:");
-for (Vector center: clusters.clusterCenters()) {
+for (Vector center: 
clusters.clusterCenters()) {
   System.out.println(" 
" + center);
 }
 double cost = clusters.computeCost(parsedData.rdd());
@@ -498,29 +498,29 @@ fact the optimal k is usually one where there is 
an “elbow”
 
 Refer to the KMeans
 Python docs and KMeansModel
 Python docs for more details on the API.
 
-from numpy import array
+from 
numpy import array
 from math import sqrt
 
 from pyspark.mllib.clustering 
import KMeans, KMeansModel
 
-# Load and parse the data
-data = sc.textFile("data/mllib/kmeans_data.txt")
-parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
+# Load and parse the data
+data = sc.textFile("data/mllib/kmeans_data.txt")
+parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
 
-# Build the model (cluster the data)
-clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
+# Build the model (cluster the data)
+clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
 
-# Evaluate clustering by computing Within Set Sum of Squared 
Errors
+# Evaluate clustering by computing Within Set Sum of Squared 
Errors
 def error(point):
 center = clusters.centers[clusters.predict(point)]
 return sqrt(sum([x**2 for x in (point - center)]))
 
 WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
-print("Within Set Sum of Squared Error = " + str(WSSSE))
+print("Within Set Sum of Squared Error = " + str(WSSSE))
 
-# Save and load model
-clusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
-sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
+# Save and load model
+clusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
+sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")
 
 Find full example code at 
"examples/src/main/python/mllib/k_means_example.py" in the Spark 
repo.
   
@@ -554,7 +554,7 @@ to the algorithm. We then output the parameters of the 
mixture model.
 
 Refer to the GaussianMixture
 Scala docs and GaussianMixtureModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel}
+import 
org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel}
 import org.apache.spark.mllib.linalg.Vectors
 
 // Load and parse the data
@@ -587,7 +587,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the GaussianMixture
 Java docs and GaussianMixtureModel
 Java docs for details on the API.
 
-import org.apache.spark.api.java.JavaRDD;
+import 
org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.mllib.clustering.GaussianMixture;
 import org.apache.spark.mllib.clustering.GaussianMixtureModel;
@@ -612,7 +612,7 @@ that is equivalent to the provided example in Scala is 
given below:
 parsedData.cache();
 
 // Cluster the data into two classes using 
Gaussian

[13/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-evaluation-metrics.html
--
diff --git a/site/docs/2.1.0/mllib-evaluation-metrics.html 
b/site/docs/2.1.0/mllib-evaluation-metrics.html
index 4bc636d..0d5bb3b 100644
--- a/site/docs/2.1.0/mllib-evaluation-metrics.html
+++ b/site/docs/2.1.0/mllib-evaluation-metrics.html
@@ -307,20 +307,20 @@
 
 
 
-  Classification model 
evaluation
-  Binary classification
-  Threshold tuning
+  Classification model 
evaluation
+  Binary classification

+  Threshold tuning
 
   
-  Multiclass classification   
 
-  Label based metrics
+  Multiclass classification   
 
+  Label based metrics
 
   
-  Multilabel classification
-  Ranking 
systems
+  Multilabel 
classification
+  Ranking systems
 
   
-  Regression model 
evaluation
+  Regression model 
evaluation
 
 
 spark.mllib comes with a number of machine learning algorithms 
that can be used to learn from and make predictions
@@ -421,7 +421,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 Refer to the LogisticRegressionWithLBFGS
 Scala docs and BinaryClassificationMetrics
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import 
org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
 import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.util.MLUtils
@@ -453,13 +453,13 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 // Precision by threshold
 val precision = metrics.precisionByThreshold
 precision.foreach { case 
(t, p) =>
-  println(s"Threshold: $t, Precision: 
$p")
+  println(s"Threshold: $t, 
Precision: $p")
 }
 
 // Recall by threshold
 val recall = metrics.recallByThreshold
 recall.foreach { case 
(t, r) =>
-  println(s"Threshold: $t, Recall: 
$r")
+  println(s"Threshold: $t, 
Recall: $r")
 }
 
 // Precision-Recall Curve
@@ -468,13 +468,13 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 // F-measure
 val f1Score = metrics.fMeasureByThreshold
 f1Score.foreach { case 
(t, f) =>
-  println(s"Threshold: $t, F-score: $f, Beta = 
1")
+  println(s"Threshold: $t, 
F-score: $f, Beta = 
1")
 }
 
 val beta = 0.5
 val fScore = metrics.fMeasureByThreshold(beta)
 f1Score.foreach { case 
(t, f) =>
-  println(s"Threshold: $t, F-score: $f, Beta = 
0.5")
+  println(s"Threshold: $t, 
F-score: $f, Beta = 
0.5")
 }
 
 // AUPRC
@@ -498,7 +498,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 Refer to the LogisticRegressionModel
 Java docs and LogisticRegressionWithLBFGS
 Java docs for details on the API.
 
-import scala.Tuple2;
+import 
scala.Tuple2;
 
 import org.apache.spark.api.java.*;
 import org.apache.spark.api.java.function.Function;
@@ -518,7 +518,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 JavaRDD test = splits[1];
 
 // Run training algorithm to build the model.
-final LogisticRegressionModel 
model = new LogisticRegressionWithLBFGS()
+final LogisticRegressionModel 
model = new LogisticRegressionWithLBFGS()
   .setNumClasses(2)
   .run(training.rdd());
 
@@ -538,7 +538,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 // Get evaluation metrics.
 BinaryClassificationMetrics metrics =
-  new BinaryClassificationMetrics(predictionAndLabels.rdd());
+  new BinaryClassificationMetrics(predictionAndLabels.rdd());
 
 // Precision by threshold
 JavaRDD> precision = metrics.precisionByThreshold().toJavaRDD();
@@ -564,7 +564,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
   new Function, Double>() {
 @Override
 public Double call(Tuple2 t) {
-  return new Double(t._1().toString());
+  return new Double(t._1().toString());
 }
   }
 );
@@ -590,34 +590,34 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 Refer to the BinaryClassificationMetrics
 Python docs and LogisticRegressionWithLBFGS
 Python docs for more details on the API.
 
-from pyspark.mllib.classification import 
LogisticRegressionWithLBFGS
+from 
pyspark.mllib.classification import LogisticRegressionWithLBFGS
 from pyspark.mllib.evaluation 
import BinaryClassificationMetrics
 from pyspark.mllib.regression 
import LabeledPoint
 
-# Several of the methods available in scala are currently 
missing from pyspark
-# Load training data in LIBSVM format
+# Several of the methods available in scala are curren

[14/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-decision-tree.html
--
diff --git a/site/docs/2.1.0/mllib-decision-tree.html 
b/site/docs/2.1.0/mllib-decision-tree.html
index 1a3d865..991610e 100644
--- a/site/docs/2.1.0/mllib-decision-tree.html
+++ b/site/docs/2.1.0/mllib-decision-tree.html
@@ -307,23 +307,23 @@
 
 
 
-  Basic 
algorithm
-  Node impurity and 
information gain
-  Split 
candidates
-  Stopping 
rule
+  Basic algorithm
+  Node impurity and 
information gain
+  Split candidates
+  Stopping rule
 
   
-  Usage tips
-  Problem specification 
parameters
-  Stopping criteria
-  Tunable parameters
-  Caching and checkpointing
+  Usage tips
+  Problem specification 
parameters
+  Stopping criteria
+  Tunable parameters
+  Caching and 
checkpointing
 
   
-  Scaling
-  Examples
-  Classification
-  Regression
+  Scaling
+  Examples
+  Classification
+  Regression
 
   
 
@@ -548,7 +548,7 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 
 Refer to the DecisionTree
 Scala docs and DecisionTreeModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.tree.DecisionTree
+import 
org.apache.spark.mllib.tree.DecisionTree
 import org.apache.spark.mllib.tree.model.DecisionTreeModel
 import org.apache.spark.mllib.util.MLUtils
 
@@ -588,7 +588,7 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 
 Refer to the DecisionTree
 Java docs and DecisionTreeModel
 Java docs for details on the API.
 
-import java.util.HashMap;
+import 
java.util.HashMap;
 import java.util.Map;
 
 import scala.Tuple2;
@@ -604,8 +604,8 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 import org.apache.spark.mllib.tree.model.DecisionTreeModel;
 import org.apache.spark.mllib.util.MLUtils;
 
-SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTreeClassificationExample");
-JavaSparkContext jsc = new JavaSparkContext(sparkConf);
+SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTreeClassificationExample");
+JavaSparkContext jsc = new JavaSparkContext(sparkConf);
 
 // Load and parse the data file.
 String datapath = "data/mllib/sample_libsvm_data.txt";
@@ -657,30 +657,30 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 
 Refer to the DecisionTree
 Python docs and DecisionTreeModel
 Python docs for more details on the API.
 
-from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
+from 
pyspark.mllib.tree import DecisionTree, DecisionTreeModel
 from pyspark.mllib.util import MLUtils
 
-# Load and parse the data file into an RDD of 
LabeledPoint.
-data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
-# Split the data into training and test sets (30% held out for 
testing)
+# Load and parse the data file into an RDD of 
LabeledPoint.
+data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
+# Split the data into training and test sets (30% held out 
for testing)
 (trainingData, testData) 
= data.randomSplit([0.7, 0.3])
 
-# Train a DecisionTree model.
-#  Empty categoricalFeaturesInfo indicates all features are 
continuous.
+# Train a DecisionTree model.
+#  Empty categoricalFeaturesInfo indicates all features are 
continuous.
 model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
- impurity='gini', 
maxDepth=5, maxBins=32)
+ impurity='gini', maxDepth=5, maxBins=32)
 
-# Evaluate model on test instances and compute test 
error
+# Evaluate model on test instances and compute test 
error
 predictions = model.predict(testData.map(lambda x: x.features))
 labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
 testErr = labelsAndPredictions.filter(lambda 
(v, p): v != p).count() / float(testData.count())
-print('Test 
Error = ' + str(testErr))
-print('Learned classification tree model:')
+print('Test 
Error = ' + str(testErr))
+print('Learned classification tree model:')
 print(model.toDebugString())
 
-# Save and load model
-model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
-sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
+# Save and load model
+model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
+sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
 
 Find full example code at 
"examples/src/main/python/mllib/decision_tree_classification_example.py" in the 
Spark repo.
   
@@ -701,7 +701,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the 
end to evaluate
 
 Refer to the DecisionTree
 Scala docs

[20/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-features.html
--
diff --git a/site/docs/2.1.0/ml-features.html b/site/docs/2.1.0/ml-features.html
index 64463de..a2f102b 100644
--- a/site/docs/2.1.0/ml-features.html
+++ b/site/docs/2.1.0/ml-features.html
@@ -318,52 +318,52 @@
 Table of Contents
 
 
-  Feature Extractors
-  TF-IDF
-  Word2Vec
-  CountVectorizer
+  Feature Extractors
+  TF-IDF
+  Word2Vec
+  CountVectorizer
 
   
-  Feature Transformers
-  Tokenizer
-  StopWordsRemover
-  $n$-gram
-  Binarizer
-  PCA
-  PolynomialExpansion
-  Discrete Cosine Transform 
(DCT)
-  StringIndexer
-  IndexToString
-  OneHotEncoder
-  VectorIndexer
-  Interaction
-  Normalizer
-  StandardScaler
-  MinMaxScaler
-  MaxAbsScaler
-  Bucketizer
-  ElementwiseProduct
-  SQLTransformer
-  VectorAssembler
-  QuantileDiscretizer
+  Feature Transformers
+  Tokenizer
+  StopWordsRemover
+  $n$-gram
+  Binarizer
+  PCA
+  PolynomialExpansion
+  Discrete Cosine Transform 
(DCT)
+  StringIndexer
+  IndexToString
+  OneHotEncoder
+  VectorIndexer
+  Interaction
+  Normalizer
+  StandardScaler
+  MinMaxScaler
+  MaxAbsScaler
+  Bucketizer
+  ElementwiseProduct
+  SQLTransformer
+  VectorAssembler
+  QuantileDiscretizer
 
   
-  Feature 
Selectors
-  VectorSlicer
-  RFormula
-  ChiSqSelector
+  Feature Selectors
+  VectorSlicer
+  RFormula
+  ChiSqSelector
 
   
-  Locality Sensitive Hashing

-  LSH 
Operations
-  Feature Transformation
-  Approximate Similarity 
Join
-  Approximate Nearest 
Neighbor Search
+  Locality Sensitive Hashing

+  LSH Operations
+  Feature Transformation
+  Approximate Similarity 
Join
+  Approximate 
Nearest Neighbor Search
 
   
-  LSH 
Algorithms
-  Bucketed 
Random Projection for Euclidean Distance
-  MinHash for Jaccard 
Distance
+  LSH Algorithms
+  Bucketed Random 
Projection for Euclidean Distance
+  MinHash for Jaccard 
Distance
 
   
 
@@ -395,7 +395,7 @@ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
 There are several variants on the definition of term frequency and document 
frequency.
 In MLlib, we separate TF and IDF to make them flexible.
 
-TF: Both HashingTF and 
CountVectorizer can be used to generate the term frequency 
vectors.
+TF: Both HashingTF and 
CountVectorizer can be used to generate the term frequency 
vectors. 
 
 HashingTF is a Transformer which takes sets of 
terms and converts those sets into 
 fixed-length feature vectors.  In text processing, a “set of 
terms” might be a bag of words.
@@ -437,7 +437,7 @@ when using text as features.  Our feature vectors could 
then be passed to a lear
 Refer to the HashingTF 
Scala docs and
 the IDF Scala 
docs for more details on the API.
 
-import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
+import 
org.apache.spark.ml.feature.{HashingTF, 
IDF, Tokenizer}
 
 val sentenceData = spark.createDataFrame(Seq(
   (0.0, 
"Hi I heard about Spark"),
@@ -468,7 +468,7 @@ the IDF Scala doc
 Refer to the HashingTF Java 
docs and the
 IDF Java docs for 
more details on the API.
 
-import java.util.Arrays;
+import 
java.util.Arrays;
 import java.util.List;
 
 import org.apache.spark.ml.feature.HashingTF;
@@ -489,17 +489,17 @@ the IDF Scala doc
   RowFactory.create(0.0, "I wish Java 
could use case classes"),
   RowFactory.create(1.0, "Logistic 
regression models are neat")
 );
-StructType schema = new StructType(new 
StructField[]{
-  new StructField("label", DataTypes.DoubleType, 
false, Metadata.empty()),
-  new StructField("sentence", DataTypes.StringType, 
false, Metadata.empty())
+StructType schema = new StructType(new 
StructField[]{
+  new StructField("label", DataTypes.DoubleType, 
false, Metadata.empty()),
+  new StructField("sentence", DataTypes.StringType, 
false, Metadata.empty())
 });
 Dataset sentenceData = spark.createDataFrame(data, schema);
 
-Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
+Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
 Dataset wordsData = tokenizer.transform(sentenceData);
 
 int numFeatures = 20;
-HashingTF hashingTF = new HashingTF()
+HashingTF hashingTF = new HashingTF()
   .setInputCol("words")
   .setOutputCol("rawFeatures")
   .setNumFeatures(numFeatures);
@@ -507,7 +507,7 @@ the IDF Scala doc
 Dataset featurizedData = hashingTF.transform(wordsData);
 // alternatively, CountVectorizer can also be used to get 
term frequency vectors
 
-IDF idf = 
new IDF().setInputCol("rawFeatur

[05/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/storage-openstack-swift.html
--
diff --git a/site/docs/2.1.0/storage-openstack-swift.html 
b/site/docs/2.1.0/storage-openstack-swift.html
index bbb3446..a20c67f 100644
--- a/site/docs/2.1.0/storage-openstack-swift.html
+++ b/site/docs/2.1.0/storage-openstack-swift.html
@@ -144,7 +144,7 @@ Current Swift driver requires Swift to use Keystone 
authentication method.
 The Spark application should include hadoop-openstack 
dependency.
 For example, for Maven support, add the following to the pom.xml 
file:
 
-
+
   ...
   
 org.apache.hadoop
@@ -152,15 +152,15 @@ For example, for Maven support, add the following to the 
pom.xml fi
 2.3.0
   
   ...
-
+
 
 Configuration Parameters
 
 Create core-site.xml and place it inside Spark’s 
conf directory.
 There are two main categories of parameters that should to be configured: 
declaration of the
-Swift driver and the parameters that are required by Keystone.
+Swift driver and the parameters that are required by Keystone. 
 
-Configuration of Hadoop to use Swift File system achieved via
+Configuration of Hadoop to use Swift File system achieved via 
 
 
 Property NameValue
@@ -221,7 +221,7 @@ contains a list of Keystone mandatory parameters. 
PROVIDER can be a
 For example, assume PROVIDER=SparkTest and Keystone contains 
user tester with password testing
 defined for tenant test. Then core-site.xml should 
include:
 
-
+
   
 fs.swift.impl
 org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem
@@ -257,7 +257,7 @@ defined for tenant test. Then 
core-site.xml should inc
 fs.swift.service.SparkTest.password
 testing
   
-
+
 
 Notice that
 fs.swift.service.PROVIDER.tenant,

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/streaming-custom-receivers.html
--
diff --git a/site/docs/2.1.0/streaming-custom-receivers.html 
b/site/docs/2.1.0/streaming-custom-receivers.html
index d31647d..846c797 100644
--- a/site/docs/2.1.0/streaming-custom-receivers.html
+++ b/site/docs/2.1.0/streaming-custom-receivers.html
@@ -171,7 +171,7 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 
 
 
-class CustomReceiver(host: String, port: Int)
+class CustomReceiver(host: String, port: Int)
   extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
 
   def onStart() {
@@ -216,12 +216,12 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 restart("Error receiving data", t)
 }
   }
-}
+}
 
   
 
 
-public class 
JavaCustomReceiver extends 
Receiver {
+public class JavaCustomReceiver extends Receiver 
{
 
   String host = null;
   int port = -1;
@@ -234,7 +234,7 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 
   public void onStart() {
 // Start the thread that receives data over a 
connection
-new Thread()  {
+new Thread()  {
   @Override public void run() 
{
 receive();
   }
@@ -253,10 +253,10 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 
 try {
   // connect to the server
-  socket = new Socket(host, port);
+  socket = new Socket(host, port);
 
-  BufferedReader reader 
= new BufferedReader(
-new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
+  BufferedReader reader 
= new BufferedReader(
+new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
 
   // Until stopped or connection broken continue 
reading
   while (!isStopped() && (userInput = reader.readLine()) != 
null) {
@@ -276,7 +276,7 @@ has any error connecting or receiving, the receiver is 
restarted to make another
   restart("Error receiving data", t);
 }
   }
-}
+}
 
   
 
@@ -290,20 +290,20 @@ an input DStream using data received by the instance of 
custom receiver, as show
 
 
 
-// Assuming ssc is the 
StreamingContext
+// Assuming ssc is the 
StreamingContext
 val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port))
 val words = lines.flatMap(_.split(" "))
-...
+...
 
 The full source code is in the example https://github.com/apache/spark/blob/v2.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala";>CustomReceiver.scala.
 
   
 
 
-// Assuming ssc is the 
JavaStreamingContext
-JavaDStream customReceiverStream = ssc.receiverStream(new JavaCustomReceiver(host, port));

[21/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-clustering.html
--
diff --git a/site/docs/2.1.0/ml-clustering.html 
b/site/docs/2.1.0/ml-clustering.html
index e225281..df38605 100644
--- a/site/docs/2.1.0/ml-clustering.html
+++ b/site/docs/2.1.0/ml-clustering.html
@@ -313,21 +313,21 @@ about these algorithms.
 Table of Contents
 
 
-  K-means
-  Input 
Columns
-  Output 
Columns
-  Example
+  K-means
+  Input Columns
+  Output Columns
+  Example
 
   
-  Latent Dirichlet allocation 
(LDA)
-  Bisecting k-means
-  Example
+  Latent Dirichlet allocation 
(LDA)
+  Bisecting k-means
+  Example
 
   
-  Gaussian Mixture Model (GMM)   
 
-  Input 
Columns
-  Output Columns
-  Example
+  Gaussian Mixture Model (GMM)   
 
+  Input Columns
+  Output Columns
+  Example
 
   
 
@@ -391,7 +391,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf";>kmea
 
 Refer to the Scala API 
docs for more details.
 
-import org.apache.spark.ml.clustering.KMeans
+import 
org.apache.spark.ml.clustering.KMeans
 
 // Loads data.
 val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
@@ -402,7 +402,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf";>kmea
 
 // Evaluate clustering by computing Within Set Sum of Squared 
Errors.
 val WSSSE = model.computeCost(dataset)
-println(s"Within Set Sum of Squared Errors = 
$WSSSE")
+println(s"Within Set Sum of Squared Errors = $WSSSE")
 
 // Shows the result.
 println("Cluster Centers: ")
@@ -414,7 +414,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf";>kmea
 
 Refer to the Java API docs 
for more details.
 
-import org.apache.spark.ml.clustering.KMeansModel;
+import 
org.apache.spark.ml.clustering.KMeansModel;
 import org.apache.spark.ml.clustering.KMeans;
 import org.apache.spark.ml.linalg.Vector;
 import org.apache.spark.sql.Dataset;
@@ -424,7 +424,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf";>kmea
 Dataset dataset 
= spark.read().format("libsvm").load("data/mllib/sample_kmeans_data.txt");
 
 // Trains a k-means model.
-KMeans kmeans = new KMeans().setK(2).setSeed(1L);
+KMeans kmeans = new KMeans().setK(2).setSeed(1L);
 KMeansModel model = kmeans.fit(dataset);
 
 // Evaluate clustering by computing Within Set Sum of Squared 
Errors.
@@ -434,7 +434,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf";>kmea
 // Shows the result.
 Vector[] centers = model.clusterCenters();
 System.out.println("Cluster 
Centers: ");
-for (Vector center: centers) {
+for (Vector center: 
centers) {
   System.out.println(center);
 }
 
@@ -444,22 +444,22 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf";>kmea
 
 Refer to the Python API 
docs for more details.
 
-from pyspark.ml.clustering import KMeans
+from 
pyspark.ml.clustering import 
KMeans
 
-# Loads data.
-dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
+# Loads data.
+dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
 
-# Trains a k-means model.
+# Trains a k-means model.
 kmeans = KMeans().setK(2).setSeed(1)
 model = kmeans.fit(dataset)
 
-# Evaluate clustering by computing Within Set Sum of Squared 
Errors.
+# Evaluate clustering by computing Within Set Sum of Squared 
Errors.
 wssse = model.computeCost(dataset)
-print("Within Set Sum of Squared Errors = " + str(wssse))
+print("Within Set Sum of Squared Errors = " + str(wssse))
 
-# Shows the result.
+# Shows the result.
 centers = model.clusterCenters()
-print("Cluster Centers: ")
+print("Cluster Centers: ")
 for center in centers:
 print(center)
 
@@ -470,7 +470,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf";>kmea
 
 Refer to the R API docs for more 
details.
 
-# Fit a k-means model with 
spark.kmeans
+# Fit a k-means 
model with spark.kmeans
 irisDF <- suppressWarnings(createDataFrame(iris))
 kmeansDF <- irisDF
 kmeansTestDF <- irisDF
@@ -504,7 +504,7 @@ and generates a LDAModel as the base model. 
Expert users may cast a
 
 Refer to the Scala API 
docs for more details.
 
-import org.apache.spark.ml.clustering.LDA
+import 
org.apache.spark.ml.clustering.LDA
 
 // Loads data.
 val dataset = spark.read.format("libsvm")
@@ -516,8 +516,8 @@ and generates a LDAModel as the base model. 
Expert users may cast a
 
 val ll = 
model.logLikelihood(dataset)
 val lp = 
model.logPerplexity(dataset)
-println(s"The lower bound on the log likelihood 
of the entire corpus: $ll")
-println(s"The upper bound bound on perplexity: 
$lp")
+println(s"The lower bound on the log likelihood of the entire corpus: 
$ll")
+println(s"The upper bound bound on perplexity: $lp")
 
 // Describe topics.
 val topics = model.describeTopics(3)
@@ -53

[02/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/structured-streaming-programming-guide.html
--
diff --git a/site/docs/2.1.0/structured-streaming-programming-guide.html 
b/site/docs/2.1.0/structured-streaming-programming-guide.html
index e54c101..3a1ac5f 100644
--- a/site/docs/2.1.0/structured-streaming-programming-guide.html
+++ b/site/docs/2.1.0/structured-streaming-programming-guide.html
@@ -127,45 +127,50 @@
 
 
 
-  Overview
-  Quick 
Example
-  Programming Model
-  Basic 
Concepts
-  Handling Event-time and 
Late Data
-  Fault Tolerance Semantics
+  Overview
+  Quick Example
+  Programming Model
+  Basic Concepts
+  Handling Event-time and 
Late Data
+  Fault Tolerance 
Semantics
 
   
-  API using Datasets and 
DataFrames
-  Creating 
streaming DataFrames and streaming Datasets
-  Data 
Sources
-  Schema
 inference and partition of streaming DataFrames/Datasets
+  API using Datasets and 
DataFrames
+  Creating streaming 
DataFrames and streaming Datasets
+  Data Sources
+  Schema 
inference and partition of streaming DataFrames/Datasets
 
   
-  Operations on 
streaming DataFrames/Datasets
-  Basic 
Operations - Selection, Projection, Aggregation
-  Window Operations on Event 
Time
-  Join Operations
-  Unsupported Operations
+  Operations on 
streaming DataFrames/Datasets
+  Basic Operations - 
Selection, Projection, Aggregation
+  Window Operations on 
Event Time
+  Handling Late 
Data and Watermarking
+  Join Operations
+  Unsupported Operations
 
   
-  Starting Streaming Queries 
   
-  Output 
Modes
-  Output 
Sinks
-  Using 
Foreach
+  Starting Streaming Queries 
   
+  Output Modes
+  Output Sinks
+  Using Foreach
 
   
-  Managing Streaming Queries
-  Monitoring Streaming 
Queries
-  Recovering from 
Failures with Checkpointing
+  Managing Streaming 
Queries
+  Monitoring Streaming 
Queries
+  Interactive APIs
+  Asynchronous API
+
+  
+  Recovering 
from Failures with Checkpointing
 
   
-  Where to go from here
+  Where to go from here
 
 
 Overview
 Structured Streaming is a scalable and fault-tolerant stream processing 
engine built on the Spark SQL engine. You can express your streaming 
computation the same way you would express a batch computation on static 
data.The Spark SQL engine will take care of running it incrementally and 
continuously and updating the final result as streaming data continues to 
arrive. You can use the Dataset/DataFrame 
API in Scala, Java or Python to express streaming aggregations, event-time 
windows, stream-to-batch joins, etc. The computation is executed on the same 
optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once 
fault-tolerance guarantees through checkpointing and Write Ahead Logs. In 
short, Structured Streaming provides fast, scalable, fault-tolerant, 
end-to-end exactly-once stream processing without the user having to reason 
about streaming.
 
-Spark 2.0 is the ALPHA RELEASE of Structured Streaming and 
the APIs are still experimental. In this guide, we are going to walk you 
through the programming model and the APIs. First, let’s start with a 
simple example - a streaming word count.
+Structured Streaming is still ALPHA in Spark 2.1 and the 
APIs are still experimental. In this guide, we are going to walk you through 
the programming model and the APIs. First, let’s start with a simple 
example - a streaming word count. 
 
 Quick Example
 Letâs say you want to maintain a running word count of text data received 
from a data server listening on a TCP socket. Letâs see how you can express 
this using Structured Streaming. You can see the full code in 
@@ -175,7 +180,7 @@ And if you http://spark.apache.org/downloads.html";>download Spark,
 
 
 
-import org.apache.spark.sql.functions._
+import org.apache.spark.sql.functions._
 import org.apache.spark.sql.SparkSession
 
 val spark = SparkSession
@@ -183,12 +188,12 @@ And if you http://spark.apache.org/downloads.html";>download Spark,
   .appName("StructuredNetworkWordCount")
   .getOrCreate()
   
-import spark.implicits._
+import spark.implicits._
 
   
 
 
-import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.api.java.function.FlatMapFunction;
 import org.apache.spark.sql.*;
 import org.apache.spark.sql.streaming.StreamingQuery;
 
@@ -198,19 +203,19 @@ And if you http://spark.apache.org/downloads.html";>download Spark,
 SparkSession spark = SparkSession
   .builder()
   .appName("JavaStructuredNetworkWordCount")
-  .getOrCr

[10/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-pmml-model-export.html
--
diff --git a/site/docs/2.1.0/mllib-pmml-model-export.html 
b/site/docs/2.1.0/mllib-pmml-model-export.html
index 30815e0..3f2fd91 100644
--- a/site/docs/2.1.0/mllib-pmml-model-export.html
+++ b/site/docs/2.1.0/mllib-pmml-model-export.html
@@ -307,8 +307,8 @@
 
 
 
-  spark.mllib 
supported models
-  Examples
+  spark.mllib 
supported models
+  Examples
 
 
 spark.mllib supported 
models
@@ -353,32 +353,31 @@
 
 Refer to the KMeans
 Scala docs and Vectors
 Scala docs for details on the API.
 
-Here a complete example of building a KMeansModel and print it out in 
PMML format:
-import org.apache.spark.mllib.clustering.KMeans
-import org.apache.spark.mllib.linalg.Vectors
+Here a complete example of building a KMeansModel and print it out in 
PMML format:
+import org.apache.spark.mllib.clustering.KMeans
+import org.apache.spark.mllib.linalg.Vectors
 
-// Load and parse the data
+// Load and parse the data
 val data = sc.textFile("data/mllib/kmeans_data.txt")
-val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
+val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
 
-// Cluster the data into two classes using KMeans
+// Cluster the data into two classes using 
KMeans
 val numClusters = 2
 val numIterations = 20
-val clusters = KMeans.train(parsedData, numClusters, numIterations)
+val clusters = KMeans.train(parsedData, numClusters, numIterations)
 
-// Export to PMML to a String in PMML format
-println("PMML Model:\n" + clusters.toPMML)
+// Export to PMML to a String in PMML format
+println("PMML Model:\n" + clusters.toPMML)
 
-// Export the model to a local file in PMML format
-clusters.toPMML("/tmp/kmeans.xml")
+// Export the model to a local file in PMML 
format
+clusters.toPMML("/tmp/kmeans.xml")
 
-// Export the model to a directory on a distributed file 
system in PMML format
-clusters.toPMML(sc, "/tmp/kmeans")
+// Export the model to a directory on a distributed 
file system in PMML format
+clusters.toPMML(sc, "/tmp/kmeans")
 
-// Export the model to the OutputStream in PMML format
+// Export the model to the OutputStream in PMML 
format
 clusters.toPMML(System.out)
-
-Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala"
 in the Spark repo.
+
Find full example code at 
“examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala”
 in the Spark repo.
 
 For unsupported models, either you will not find a .toPMML 
method or an IllegalArgumentException will be thrown.
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-statistics.html
--
diff --git a/site/docs/2.1.0/mllib-statistics.html 
b/site/docs/2.1.0/mllib-statistics.html
index 4485ecf..f04924c 100644
--- a/site/docs/2.1.0/mllib-statistics.html
+++ b/site/docs/2.1.0/mllib-statistics.html
@@ -358,15 +358,15 @@
 
 
 
-  Summary statistics
-  Correlations
-  Stratified sampling
-  Hypothesis testing
-  Streaming Significance 
Testing
+  Summary statistics
+  Correlations
+  Stratified sampling
+  Hypothesis testing
+  Streaming Significance 
Testing
 
   
-  Random data generation
-  Kernel density estimation
+  Random data generation
+  Kernel density estimation
 
 
 \[
@@ -401,7 +401,7 @@ total count.
 
 Refer to the MultivariateStatisticalSummary
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vectors
+import 
org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
 
 val observations = sc.parallelize(
@@ -430,7 +430,7 @@ total count.
 
 Refer to the MultivariateStatisticalSummary
 Java docs for details on the API.
 
-import java.util.Arrays;
+import 
java.util.Arrays;
 
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.Vector;
@@ -463,19 +463,19 @@ total count.
 
 Refer to the MultivariateStatisticalSummary
 Python docs for more details on the API.
 
-import numpy as np
+import 
numpy as np
 
 from pyspark.mllib.stat import Statistics
 
 mat = sc.parallelize(
 [np.array([1.0, 10.0, 100.0]), 
np.array([2.0, 20.0, 200.0]), np.array([3.0, 30.0, 300.0])]
-)  # an RDD of Vectors
+)  # an RDD of Vectors
 
-# Compute column summary statistics.
+# Compute column summary statistics.
 summary = Statistics.colStats(mat)
-print(summary.mean())  # a dense 
vector containing the mean value for each column
-print(summary.variance())  # 
column-wise variance
-print(summary.numNonzero

[09/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/programming-guide.html
--
diff --git a/site/docs/2.1.0/programming-guide.html 
b/site/docs/2.1.0/programming-guide.html
index 12458af..0e06e86 100644
--- a/site/docs/2.1.0/programming-guide.html
+++ b/site/docs/2.1.0/programming-guide.html
@@ -129,50 +129,50 @@
 
 
 
-  Overview
-  Linking with Spark
-  Initializing Spark
-  Using 
the Shell
+  Overview
+  Linking with Spark
+  Initializing Spark
+  Using the Shell
 
   
-  Resilient Distributed 
Datasets (RDDs)
-  Parallelized Collections
-  External Datasets
-  RDD 
Operations
-  Basics
-  Passing Functions to Spark
-  Understanding 
closures 
-  Example
-  Local vs. cluster modes
-  Printing elements of an 
RDD
+  Resilient Distributed 
Datasets (RDDs)
+  Parallelized Collections
+  External Datasets
+  RDD Operations
+  Basics
+  Passing Functions to 
Spark
+  Understanding closures 
+  Example
+  Local vs. cluster 
modes
+  Printing elements of 
an RDD
 
   
-  Working with Key-Value 
Pairs
-  Transformations
-  Actions
-  Shuffle operations
-  Background
-  Performance Impact
+  Working with Key-Value 
Pairs
+  Transformations
+  Actions
+  Shuffle operations

+  Background
+  Performance Impact
 
   
 
   
-  RDD 
Persistence
-  Which Storage Level to 
Choose?
-  Removing Data
+  RDD Persistence
+  Which Storage Level to 
Choose?
+  Removing Data
 
   
 
   
-  Shared 
Variables
-  Broadcast Variables
-  Accumulators
+  Shared Variables
+  Broadcast Variables
+  Accumulators
 
   
-  Deploying to a Cluster
-  Launching Spark jobs 
from Java / Scala
-  Unit 
Testing
-  Where to Go from Here
+  Deploying to a Cluster
+  Launching Spark jobs 
from Java / Scala
+  Unit Testing
+  Where to Go from Here
 
 
 Overview
@@ -212,8 +212,8 @@ version = 
 
 Finally, you need to import some Spark classes into your program. Add 
the following lines:
 
-import org.apache.spark.SparkContext
-import org.apache.spark.SparkConf
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkConf
 
 (Before Spark 1.3.0, you need to explicitly import 
org.apache.spark.SparkContext._ to enable essential implicit 
conversions.)
 
@@ -245,9 +245,9 @@ version = 
 
 Finally, you need to import some Spark classes into your program. Add 
the following lines:
 
-import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.api.java.JavaSparkContext
 import org.apache.spark.api.java.JavaRDD
-import org.apache.spark.SparkConf
+import org.apache.spark.SparkConf
 
   
 
@@ -269,13 +269,13 @@ for common HDFS versions.
 
 Finally, you need to import some Spark classes into your program. Add 
the following line:
 
-from pyspark 
import SparkContext, SparkConf
+from pyspark import SparkContext, SparkConf
 
 PySpark requires the same minor version of Python in both driver and 
workers. It uses the default python version in PATH,
 you can specify which version of Python you want to use by 
PYSPARK_PYTHON, for example:
 
-$ PYSPARK_PYTHON=python3.4 bin/pyspark
-$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit 
examples/src/main/python/pi.py
+$ PYSPARK_PYTHON=python3.4 bin/pyspark
+$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit 
examples/src/main/python/pi.py
 
   
 
@@ -293,8 +293,8 @@ that contains information about your application.
 
 Only one SparkContext may be active per JVM.  You must 
stop() the active SparkContext before creating a new one.
 
-val conf = new SparkConf().setAppName(appName).setMaster(master)
-new SparkContext(conf)
+val conf = new SparkConf().setAppName(appName).setMaster(master)
+new SparkContext(conf)
 
   
 
@@ -304,8 +304,8 @@ that contains information about your application.
 how to access a cluster. To create a SparkContext you first need 
to build a SparkConf object
 that contains information about your application.
 
-SparkConf conf 
= new SparkConf().setAppName(appName).setMaster(master);
-JavaSparkContext sc = new JavaSparkContext(conf);
+SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
+JavaSparkContext sc = new JavaSparkContext(conf);
 
   
 
@@ -315,8 +315,8 @@ that contains information about your application.
 how to access a cluster. To create a SparkContext you first need 
to build a SparkConf 
object
 that

[22/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-classification-regression.html
--
diff --git a/site/docs/2.1.0/ml-classification-regression.html 
b/site/docs/2.1.0/ml-classification-regression.html
index 1e0665b..0b264bb 100644
--- a/site/docs/2.1.0/ml-classification-regression.html
+++ b/site/docs/2.1.0/ml-classification-regression.html
@@ -329,58 +329,58 @@ discussing specific classes of algorithms, such as linear 
methods, trees, and en
 Table of Contents
 
 
-  Classification
-  Logistic regression
-  Binomial logistic 
regression
-  Multinomial logistic 
regression
+  Classification
+  Logistic regression
+  Binomial logistic 
regression
+  Multinomial logistic 
regression
 
   
-  Decision tree classifier
-  Random forest classifier
-  Gradient-boosted tree 
classifier
-  Multilayer perceptron 
classifier
-  One-vs-Rest classifier 
(a.k.a. One-vs-All)
-  Naive 
Bayes
+  Decision tree classifier
+  Random forest classifier
+  Gradient-boosted tree 
classifier
+  Multilayer perceptron 
classifier
+  One-vs-Rest 
classifier (a.k.a. One-vs-All)
+  Naive Bayes
 
   
-  Regression
-  Linear regression
-  Generalized linear 
regression
-  Available families
+  Regression
+  Linear regression
+  Generalized linear 
regression
+  Available families
 
   
-  Decision tree regression
-  Random forest regression
-  Gradient-boosted tree 
regression
-  Survival regression
-  Isotonic regression
-  Examples
+  Decision tree regression
+  Random forest regression
+  Gradient-boosted tree 
regression
+  Survival regression
+  Isotonic regression
+  Examples
 
   
 
   
-  Linear 
methods
-  Decision 
trees
-  Inputs and Outputs
-  Input 
Columns
-  Output Columns
+  Linear methods
+  Decision trees
+  Inputs and Outputs
+  Input Columns
+  Output Columns
 
   
 
   
-  Tree 
Ensembles
-  Random 
Forests
-  Inputs and Outputs
-  Input Columns
-  Output Columns 
(Predictions)
+  Tree Ensembles
+  Random Forests
+  Inputs and Outputs   
 
+  Input Columns
+  Output Columns 
(Predictions)
 
   
 
   
-  Gradient-Boosted Trees (GBTs) 
   
-  Inputs and Outputs
-  Input Columns
-  Output Columns 
(Predictions)
+  Gradient-Boosted Trees 
(GBTs)
+  Inputs and Outputs   
 
+  Input Columns
+  Output Columns 
(Predictions)
 
   
 
@@ -407,7 +407,7 @@ parameter to select between these two algorithms, or leave 
it unset and Spark wi
 
 Binomial logistic regression
 
-For more background and more details about the implementation of binomial 
logistic regression, refer to the documentation of logistic regression in 
spark.mllib.
+For more background and more details about the implementation of binomial 
logistic regression, refer to the documentation of logistic regression in 
spark.mllib. 
 
 Example
 
@@ -421,7 +421,7 @@ $\alpha$ and regParam corresponds to 
$\lambda$.
 
 More details on parameters can be found in the Scala
 API documentation.
 
-import org.apache.spark.ml.classification.LogisticRegression
+import 
org.apache.spark.ml.classification.LogisticRegression
 
 // Load training data
 val training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
@@ -435,7 +435,7 @@ $\alpha$ and regParam corresponds to 
$\lambda$.
 val lrModel = lr.fit(training)
 
 // Print the coefficients and intercept for logistic 
regression
-println(s"Coefficients: ${lrModel.coefficients} 
Intercept: ${lrModel.intercept}")
+println(s"Coefficients: ${lrModel.coefficients} 
Intercept: ${lrModel.intercept}")
 
 // We can also use the multinomial family for binary 
classification
 val mlr = 
new LogisticRegression()
@@ -447,8 +447,8 @@ $\alpha$ and regParam corresponds to 
$\lambda$.
 val mlrModel = mlr.fit(training)
 
 // Print the coefficients and intercepts for logistic 
regression with multinomial family
-println(s"Multinomial coefficients: 
${mlrModel.coefficientMatrix}")
-println(s"Multinomial intercepts: 
${mlrModel.interceptVector}")
+println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}")
+println(s"Multinomial intercepts: ${mlrModel.interceptVector}")
 
 Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala"
 in the Spark repo.
   
@@ -457,7 +457,7 @@ $\alpha$ and regParam corresponds

[07/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/sparkr.html
--
diff --git a/site/docs/2.1.0/sparkr.html b/site/docs/2.1.0/sparkr.html
index 0a1a347..e861a01 100644
--- a/site/docs/2.1.0/sparkr.html
+++ b/site/docs/2.1.0/sparkr.html
@@ -127,53 +127,53 @@
 
 
 
-  Overview
-  SparkDataFrame
-  Starting Up: SparkSession
-  Starting Up from RStudio
-  Creating SparkDataFrames

-  From local data frames
-  From Data Sources
-  From Hive tables
+  Overview
+  SparkDataFrame
+  Starting Up: 
SparkSession
+  Starting Up from RStudio
+  Creating SparkDataFrames 
   
+  From local data frames
+  From Data Sources
+  From Hive tables
 
   
-  SparkDataFrame Operations   
 
-  Selecting rows, columns
-  Grouping, Aggregation
-  Operating on Columns
-  Applying User-Defined 
Function
-  Run
 a given function on a large dataset using dapply or 
dapplyCollect
-  dapply
-  dapplyCollect
+  SparkDataFrame Operations   
 
+  Selecting rows, 
columns
+  Grouping, Aggregation
+  Operating on Columns
+  Applying User-Defined 
Function
+  Run
 a given function on a large dataset using dapply or 
dapplyCollect
+  dapply
+  dapplyCollect
 
   
-  Run
 a given function on a large dataset grouping by input column(s) and using 
gapply or gapplyCollect
-  gapply
-  gapplyCollect
+  Run
 a given function on a large dataset grouping by input column(s) and using 
gapply or gapplyCollect
+  gapply
+  gapplyCollect
 
   
-  Data type mapping 
between R and Spark
-  Run local 
R functions distributed using spark.lapply
-  spark.lapply
+  Data type 
mapping between R and Spark
+  Run local R 
functions distributed using spark.lapply
+  spark.lapply
 
   
 
   
 
   
-  Running SQL Queries from 
SparkR
+  Running SQL Queries from 
SparkR
 
   
-  Machine 
Learning
-  Algorithms
-  Model persistence
+  Machine Learning
+  Algorithms
+  Model persistence
 
   
-  R Function Name Conflicts
-  Migration 
Guide
-  Upgrading From SparkR 1.5.x 
to 1.6.x
-  Upgrading From SparkR 1.6.x 
to 2.0
-  Upgrading to SparkR 2.1.0
+  R Function Name Conflicts
+  Migration Guide
+  Upgrading From SparkR 
1.5.x to 1.6.x
+  Upgrading From SparkR 
1.6.x to 2.0
+  Upgrading to SparkR 2.1.0
 
   
 
@@ -202,7 +202,7 @@ You can create a SparkSession using 
sparkR.session and
 
   
 
-sparkR.session()
+sparkR.session()
 
   
 
@@ -223,11 +223,11 @@ them, pass them as you would other configuration 
properties in the sparkCo
 
   
 
-if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
+if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
   Sys.setenv(SPARK_HOME = "/home/spark")
 }
 library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
-sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
+sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
 
   
 
@@ -282,14 +282,14 @@ sparkR.session(master = 
 
-  df <- as.DataFrame(faithful)
+  df <- as.DataFrame(faithful)
 
 # Displays the first part of the SparkDataFrame
 head(df)
 ##  eruptions waiting
 ##1 3.600  79
 ##2 1.800  54
-##3 3.333  74
+##3 3.333  74
 
 
 
@@ -303,7 +303,7 @@ specifying --packages with 
spark-submit or spark
 
 
 
-  sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
+  sparkR.session(sparkPackages 
= "com.databricks:spark-avro_2.11:3.0.0")
 
 
 
@@ -311,7 +311,7 @@ specifying --packages with 
spark-submit or spark
 
 
 
-  people 
<- read.df("./examples/src/main/resources/people.json", "json")
+  people <- read.df("./examples/src/main/resources/people.json", "json")
 head(people)
 ##  agename
 ##1  NA Michael
@@ -325,7 +325,7 @@ printSchema(people)
 #  |-- name: string (nullable = true)
 
 # Similarly, multiple files can be read with read.json
-people <- read.json(c("./examples/src/main/resources/people.json", "./examples/src/main/resources/people2.json"))
+people <- read.json(c("./examples/src/main/resources/people.json", "./examples/src/main/resources/people2.json"))
 
 
 
@@ -333,7 +333,7 @@ people <- read.json(
 
-  df <- read.df(csvPath, "csv", header = "true", inferSchema = "true",

[18/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-tuning.html
--
diff --git a/site/docs/2.1.0/ml-tuning.html b/site/docs/2.1.0/ml-tuning.html
index 0c36a98..2246cc2 100644
--- a/site/docs/2.1.0/ml-tuning.html
+++ b/site/docs/2.1.0/ml-tuning.html
@@ -329,13 +329,13 @@ Built-in Cross-Validation and other tooling allow users 
to optimize hyperparamet
 Table of contents
 
 
-  Model selection 
(a.k.a. hyperparameter tuning)
-  Cross-Validation
-  Example: model 
selection via cross-validation
+  Model selection 
(a.k.a. hyperparameter tuning)
+  Cross-Validation
+  Example: 
model selection via cross-validation
 
   
-  Train-Validation Split
-  Example: 
model selection via train validation split
+  Train-Validation Split
+  Example: model 
selection via train validation split
 
   
 
@@ -396,7 +396,7 @@ However, it is also a well-established method for choosing 
parameters which is m
 
 Refer to the [`CrossValidator` Scala 
docs](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) for 
details on the API.
 
-import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.Pipeline
 import org.apache.spark.ml.classification.LogisticRegression
 import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
 import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
@@ -467,7 +467,7 @@ Refer to the [`CrossValidator` Scala 
docs](api/scala/index.html#org.apache.spark
   .select("id", 
"text", "probability", "prediction")
   .collect()
   .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
-println(s"($id, $text) --> prob=$prob, 
prediction=$prediction")
+println(s"($id, 
$text) --> prob=$prob, prediction=$prediction")
   }
 Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala"
 in the Spark repo.
 
@@ -476,7 +476,7 @@ Refer to the [`CrossValidator` Scala 
docs](api/scala/index.html#org.apache.spark
 
 Refer to the [`CrossValidator` Java 
docs](api/java/org/apache/spark/ml/tuning/CrossValidator.html) for details on 
the API.
 
-import java.util.Arrays;
+import java.util.Arrays;
 
 import org.apache.spark.ml.Pipeline;
 import org.apache.spark.ml.PipelineStage;
@@ -493,38 +493,38 @@ Refer to the [`CrossValidator` Java 
docs](api/java/org/apache/spark/ml/tuning/Cr
 
 // Prepare training documents, which are labeled.
 Dataset training 
= spark.createDataFrame(Arrays.asList(
-  new JavaLabeledDocument(0L, "a b c d e spark", 1.0),
-  new JavaLabeledDocument(1L, "b d", 0.0),
-  new JavaLabeledDocument(2L,"spark f g h", 1.0),
-  new JavaLabeledDocument(3L, "hadoop mapreduce", 0.0),
-  new JavaLabeledDocument(4L, "b spark who", 1.0),
-  new JavaLabeledDocument(5L, "g d a y", 0.0),
-  new JavaLabeledDocument(6L, "spark fly", 1.0),
-  new JavaLabeledDocument(7L, "was mapreduce", 0.0),
-  new JavaLabeledDocument(8L, "e spark program", 1.0),
-  new JavaLabeledDocument(9L, "a e c l", 0.0),
-  new JavaLabeledDocument(10L, "spark compile", 1.0),
-  new JavaLabeledDocument(11L, "hadoop software", 0.0)
+  new JavaLabeledDocument(0L, "a b c d e spark", 1.0),
+  new JavaLabeledDocument(1L, "b d", 0.0),
+  new JavaLabeledDocument(2L,"spark f g h", 1.0),
+  new JavaLabeledDocument(3L, "hadoop mapreduce", 0.0),
+  new JavaLabeledDocument(4L, "b spark who", 1.0),
+  new JavaLabeledDocument(5L, "g d a y", 0.0),
+  new JavaLabeledDocument(6L, "spark fly", 1.0),
+  new JavaLabeledDocument(7L, "was mapreduce", 0.0),
+  new JavaLabeledDocument(8L, "e spark program", 1.0),
+  new JavaLabeledDocument(9L, "a e c l", 0.0),
+  new JavaLabeledDocument(10L, "spark compile", 1.0),
+  new JavaLabeledDocument(11L, "hadoop software", 0.0)
 ), JavaLabeledDocument.class);
 
 // Configure an ML pipeline, which consists of three stages: 
tokenizer, hashingTF, and lr.
-Tokenizer tokenizer = new Tokenizer()
+Tokenizer tokenizer = new Tokenizer()
   .setInputCol("text")
   .setOutputCol("words");
-HashingTF hashingTF = new HashingTF()
+HashingTF hashingTF = new HashingTF()
   .setNumFeatures(1000)
   .setInputCol(tokenizer.getOutputCol())
   .setOutputCol("features");
-LogisticRegression lr = new LogisticRegression()
+LogisticRegression lr = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(0.01);
-Pipeline pipeline = new Pipeline()
+Pipeline pipeline = new Pipeline()
   .setStages(new PipelineStage[] {tokenizer, 
hashingTF, lr});
 
 // We use a ParamGridBuilder to construct a grid of 
parameters to search over.
 // With 3 values for hashingTF.numFeatures and 2 values for 
lr.regParam,
 // this grid will have 3 x 2 = 6 parameter settings for 
CrossValidator to choose from.
-ParamMap[] paramGrid = new 
ParamGridBuilder()
+ParamMap[] paramGrid = new 
ParamGridBuilder()
   .addGrid(hashingTF.numFeatures(), new int[] {10, 100, 1000})
   .addGrid(lr.regParam(), new 
double[] {0.1, 0.0

[04/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/streaming-programming-guide.html
--
diff --git a/site/docs/2.1.0/streaming-programming-guide.html 
b/site/docs/2.1.0/streaming-programming-guide.html
index 9a87d23..b1ce1e1 100644
--- a/site/docs/2.1.0/streaming-programming-guide.html
+++ b/site/docs/2.1.0/streaming-programming-guide.html
@@ -129,32 +129,32 @@
 
 
 
-  Overview
-  A Quick 
Example
-  Basic 
Concepts
-  Linking
-  Initializing 
StreamingContext
-  Discretized Streams 
(DStreams)
-  Input DStreams and 
Receivers
-  Transformations on 
DStreams
-  Output Operations on 
DStreams
-  DataFrame and SQL 
Operations
-  MLlib 
Operations
-  Caching / Persistence
-  Checkpointing
-  Accumulators,
 Broadcast Variables, and Checkpoints
-  Deploying Applications
-  Monitoring Applications
+  Overview
+  A Quick Example
+  Basic Concepts
+  Linking
+  Initializing 
StreamingContext
+  Discretized Streams 
(DStreams)
+  Input DStreams and 
Receivers
+  Transformations on 
DStreams
+  Output Operations on 
DStreams
+  DataFrame and SQL 
Operations
+  MLlib Operations
+  Caching / Persistence
+  Checkpointing
+  Accumulators, 
Broadcast Variables, and Checkpoints
+  Deploying Applications
+  Monitoring Applications
 
   
-  Performance Tuning
-  Reducing the Batch 
Processing Times
-  Setting the Right Batch 
Interval
-  Memory 
Tuning
+  Performance Tuning
+  Reducing the Batch 
Processing Times
+  Setting the Right Batch 
Interval
+  Memory Tuning
 
   
-  Fault-tolerance Semantics
-  Where to Go from Here
+  Fault-tolerance Semantics
+  Where to Go from Here
 
 
 Overview
@@ -209,7 +209,7 @@ conversions from StreamingContext into our environment in 
order to add useful me
 other classes we need (like DStream). StreamingContext
 is the
 main entry point for all streaming functionality. We create a local 
StreamingContext with two execution threads,  and a batch interval of 1 
second.
 
-import org.apache.spark._
+import org.apache.spark._
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
 
@@ -217,33 +217,33 @@ main entry point for all streaming functionality. We 
create a local StreamingCon
 // The master requires 2 cores to prevent from a starvation 
scenario.
 
 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
-val ssc = 
new StreamingContext(conf, Seconds(1))
+val ssc = 
new StreamingContext(conf, Seconds(1))
 
 Using this context, we can create a DStream that represents streaming 
data from a TCP
 source, specified as hostname (e.g. localhost) and port (e.g. 
).
 
-// Create a DStream that will connect to 
hostname:port, like localhost:
-val lines = ssc.socketTextStream("localhost", )
+// Create a DStream that will 
connect to hostname:port, like localhost:
+val lines = ssc.socketTextStream("localhost", )
 
 This lines DStream represents the stream of data that will 
be received from the data
 server. Each record in this DStream is a line of text. Next, we want to split 
the lines by
 space characters into words.
 
-// Split each line into words
-val words = lines.flatMap(_.split(" "))
+// Split each line into 
words
+val words = lines.flatMap(_.split(" "))
 
 flatMap is a one-to-many DStream operation that creates a 
new DStream by
 generating multiple new records from each record in the source DStream. In 
this case,
 each line will be split into multiple words and the stream of words is 
represented as the
 words DStream.  Next, we want to count these words.
 
-import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
+import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
 // Count each word in each batch
 val pairs = words.map(word => (word, 1))
 val wordCounts = pairs.reduceByKey(_ 
+ _)
 
 // Print the first ten elements of each RDD generated in this 
DStream to the console
-wordCounts.print()
+wordCounts.print()
 
 The words DStream is further mapped (one-to-one 
transformation) to a DStream of (word,
 1) pairs, which is then reduced to get the frequency of words in each 
batch of data.
@@ -253,8 +253,8 @@ Finally, wordCounts.print() will print a few 
of the counts generate
 will perform when it is started, and no real processing has started yet. To 
start the processing
 after all the transformations have been setup, we finally call
 
-ssc.start() // 
Start the computation
-ssc.awaitTermination()  // 
Wait for the computation to terminate
+ssc.start() 
// Start the computation
+ssc.awaitTermination()  // 
Wait for the computation to terminate

[03/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/structured-streaming-kafka-integration.html
--
diff --git a/site/docs/2.1.0/structured-streaming-kafka-integration.html 
b/site/docs/2.1.0/structured-streaming-kafka-integration.html
index 5ca9259..7d2254f 100644
--- a/site/docs/2.1.0/structured-streaming-kafka-integration.html
+++ b/site/docs/2.1.0/structured-streaming-kafka-integration.html
@@ -144,7 +144,7 @@ application. See the Deploying 
subsection below.
 
 
 
-// Subscribe to 1 topic
+// Subscribe to 1 topic
 val ds1 = 
spark
   .readStream
   .format("kafka")
@@ -172,12 +172,12 @@ application. See the Deploying 
subsection below.
   .option("subscribePattern", "topic.*")
   .load()
 ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
-  .as[(String, String)]
+  .as[(String, String)]
 
   
 
 
-// Subscribe to 1 topic
+// Subscribe to 1 topic
 Dataset ds1 
= spark
   .readStream()
   .format("kafka")
@@ -202,43 +202,43 @@ application. See the Deploying 
subsection below.
   .option("kafka.bootstrap.servers", 
"host1:port1,host2:port2")
   .option("subscribePattern", "topic.*")
   .load()
-ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 
   
 
 
-# Subscribe to 1 topic
+# Subscribe to 1 topic
 ds1 = spark
   .readStream()
-  .format("kafka")
-  .option("kafka.bootstrap.servers", 
"host1:port1,host2:port2")
-  .option("subscribe", "topic1")
+  .format("kafka")
+  .option("kafka.bootstrap.servers", 
"host1:port1,host2:port2")
+  .option("subscribe", "topic1")
   .load()
-ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 
-# Subscribe to multiple topics
+# Subscribe to multiple topics
 ds2 = spark
   .readStream
-  .format("kafka")
-  .option("kafka.bootstrap.servers", 
"host1:port1,host2:port2")
-  .option("subscribe", "topic1,topic2")
+  .format("kafka")
+  .option("kafka.bootstrap.servers", 
"host1:port1,host2:port2")
+  .option("subscribe", "topic1,topic2")
   .load()
-ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 
-# Subscribe to a pattern
+# Subscribe to a pattern
 ds3 = spark
   .readStream()
-  .format("kafka")
-  .option("kafka.bootstrap.servers", 
"host1:port1,host2:port2")
-  .option("subscribePattern", "topic.*")
+  .format("kafka")
+  .option("kafka.bootstrap.servers", 
"host1:port1,host2:port2")
+  .option("subscribePattern", "topic.*")
   .load()
-ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 
   
 
 
-Each row in the source has the following schema:
-
+Each row in the source has the following schema:
+
 ColumnType
 
   key
@@ -268,7 +268,7 @@ application. See the Deploying 
subsection below.
   timestampType
   int
 
-
+
 
 The following options must be set for the Kafka source.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[16/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-collaborative-filtering.html
--
diff --git a/site/docs/2.1.0/mllib-collaborative-filtering.html 
b/site/docs/2.1.0/mllib-collaborative-filtering.html
index e453032..b3f9e08 100644
--- a/site/docs/2.1.0/mllib-collaborative-filtering.html
+++ b/site/docs/2.1.0/mllib-collaborative-filtering.html
@@ -322,13 +322,13 @@
 
 
 
-  Collaborative filtering
-  Explicit vs. implicit 
feedback
-  Scaling of the 
regularization parameter
+  Collaborative filtering
+  Explicit vs. implicit 
feedback
+  Scaling of the 
regularization parameter
 
   
-  Examples
-  Tutorial
+  Examples
+  Tutorial
 
 
 Collaborative filtering
@@ -393,7 +393,7 @@ recommendation model by measuring the Mean Squared Error of 
rating prediction.Refer to the ALS
 Scala docs for more details on the API.
 
-import org.apache.spark.mllib.recommendation.ALS
+import 
org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 
@@ -434,9 +434,9 @@ recommendation model by measuring the Mean Squared Error of 
rating prediction.If the rating matrix is derived from another source of information 
(i.e. it is inferred from
 other signals), you can use the trainImplicit method to get 
better results.
 
-val alpha = 0.01
+val alpha = 0.01
 val lambda = 0.01
-val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)
+val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)
 
   
 
@@ -449,7 +449,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the ALS 
Java docs for more details on the API.
 
-import scala.Tuple2;
+import 
scala.Tuple2;
 
 import org.apache.spark.api.java.*;
 import org.apache.spark.api.java.function.Function;
@@ -458,8 +458,8 @@ that is equivalent to the provided example in Scala is 
given below:
 import org.apache.spark.mllib.recommendation.Rating;
 import org.apache.spark.SparkConf;
 
-SparkConf conf = new SparkConf().setAppName("Java 
Collaborative Filtering Example");
-JavaSparkContext jsc = new JavaSparkContext(conf);
+SparkConf conf = new SparkConf().setAppName("Java 
Collaborative Filtering Example");
+JavaSparkContext jsc = new JavaSparkContext(conf);
 
 // Load and parse the data
 String path = "data/mllib/als/test.data";
@@ -468,7 +468,7 @@ that is equivalent to the provided example in Scala is 
given below:
   new Function() {
 public Rating call(String 
s) {
   String[] sarray = s.split(",");
-  return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
+  return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
 Double.parseDouble(sarray[2]));
 }
   }
@@ -528,36 +528,36 @@ recommendation by measuring the Mean Squared Error of 
rating prediction.
 
 Refer to the ALS
 Python docs for more details on the API.
 
-from pyspark.mllib.recommendation import 
ALS, MatrixFactorizationModel, Rating
+from 
pyspark.mllib.recommendation import ALS, 
MatrixFactorizationModel, Rating
 
-# Load and parse the data
-data = sc.textFile("data/mllib/als/test.data")
-ratings = data.map(lambda l: l.split(','))\
+# Load and parse the data
+data = sc.textFile("data/mllib/als/test.data")
+ratings = data.map(lambda l: l.split(','))\
 .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
 
-# Build the recommendation model using Alternating Least 
Squares
+# Build the recommendation model using Alternating Least 
Squares
 rank = 10
 numIterations = 10
 model = ALS.train(ratings, rank, numIterations)
 
-# Evaluate the model on training data
+# Evaluate the model on training data
 testdata = ratings.map(lambda p: (p[0], p[1]))
 predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
 ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
 MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
-print("Mean 
Squared Error = " + str(MSE))
+print("Mean Squared Error = " + 
str(MSE))
 
-# Save and load model
-model.save(sc, "target/tmp/myCollaborativeFilter")
-sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
+# Save and load model
+model.save(sc, "target/tmp/myCollaborativeFilter")
+sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
 
 Find full example code at 
"examples/src/main/python/mllib/recommendation_example.py" in the Spark 
repo.
 
 If the rating matrix is derived from other source of information (i.e. 
it is inferred from other
 signals), you can use the trainImplicit method to get better results.
 
-# Build the recommendation model using 
Alternating Least Squares based on implicit ratings
-model = ALS.

[15/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-data-types.html
--
diff --git a/site/docs/2.1.0/mllib-data-types.html 
b/site/docs/2.1.0/mllib-data-types.html
index 546d921..f7b5358 100644
--- a/site/docs/2.1.0/mllib-data-types.html
+++ b/site/docs/2.1.0/mllib-data-types.html
@@ -307,14 +307,14 @@
 
 
 
-  Local 
vector
-  Labeled 
point
-  Local 
matrix
-  Distributed matrix
-  RowMatrix
-  IndexedRowMatrix
-  CoordinateMatrix
-  BlockMatrix
+  Local vector
+  Labeled point
+  Local matrix
+  Distributed matrix
+  RowMatrix
+  IndexedRowMatrix
+  CoordinateMatrix
+  BlockMatrix
 
   
 
@@ -347,14 +347,14 @@ using the factory methods implemented in
 
 Refer to the Vector
 Scala docs and Vectors
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
 
 // Create a dense vector (1.0, 0.0, 3.0).
 val dv: 
Vector = Vectors.dense(1.0, 0.0, 3.0)
 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its 
indices and values corresponding to nonzero entries.
 val sv1: 
Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its 
nonzero entries.
-val sv2: 
Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
+val sv2: 
Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
 
 Note:
 Scala imports scala.collection.immutable.Vector by default, so 
you have to import
@@ -373,13 +373,13 @@ using the factory methods implemented in
 
 Refer to the Vector 
Java docs and Vectors 
Java docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.mllib.linalg.Vectors;
 
 // Create a dense vector (1.0, 0.0, 3.0).
 Vector dv = Vectors.dense(1.0, 0.0, 3.0);
 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its 
indices and values corresponding to nonzero entries.
-Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0});
+Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0});
 
   
 
@@ -405,18 +405,18 @@ in Ve
 
 Refer to the Vectors
 Python docs for more details on the API.
 
-import numpy 
as np
+import numpy as np
 import scipy.sparse as sps
 from pyspark.mllib.linalg 
import Vectors
 
-# Use a NumPy array as a dense vector.
+# Use a NumPy array as a dense vector.
 dv1 = np.array([1.0, 0.0, 3.0])
-# Use a Python list as a dense vector.
+# Use a Python list as a dense vector.
 dv2 = [1.0, 0.0, 3.0]
-# Create a SparseVector.
+# Create a SparseVector.
 sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
-# Use a single-column SciPy csc_matrix as a sparse 
vector.
-sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
+# Use a single-column SciPy csc_matrix as a sparse 
vector.
+sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
 
   
 
@@ -438,14 +438,14 @@ For multiclass classification, labels should be class 
indices starting from zero
 
 Refer to the LabeledPoint
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.regression.LabeledPoint
 
 // Create a labeled point with a positive label and a dense 
feature vector.
 val pos = 
LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
 
 // Create a labeled point with a negative label and a sparse 
feature vector.
-val neg = 
LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
+val neg = 
LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
 
   
 
@@ -456,14 +456,14 @@ For multiclass classification, labels should be class 
indices starting from zero
 
 Refer to the LabeledPoint
 Java docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.mllib.linalg.Vectors;
 import org.apache.spark.mllib.regression.LabeledPoint;
 
 // Create a labeled point with a positive label and a dense 
feature vector.
-LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
+LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
 
 // Create a labeled point with a negative label and a sparse 
feature vector.
-LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0}));
+LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0}));
 
   
 
@@ -474,14 +474,14 @@ For multiclass classification, labels should be class 
indices starting from zero
 
 Refer to the LabeledPoint
 Python docs for more details on the API.
 
-from pyspark.mllib.linalg import SparseVector
+from pyspark.mllib.linalg im

spark git commit: [SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

2016-12-28 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 7197a7bc7 -> 80d583bd0


[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding 
watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of 
watermarking
- Extended the Output Mode section with a table showing the compatibility 
between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by 
StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png";>

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png";>


### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)


### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das 

Closes #16294 from tdas/SPARK-18669.

(cherry picked from commit 092c6725bf039bf33299b53791e1958c4ea3f6aa)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80d583bd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80d583bd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80d583bd

Branch: refs/heads/branch-2.1
Commit: 80d583bd09de54890cddfcc0c6fd807d7200ea75
Parents: 7197a7b
Author: Tathagata Das 
Authored: Wed Dec 28 12:11:25 2016 -0800
Committer: Shixiong Zhu 
Committed: Wed Dec 28 12:11:49 2016 -0800

--
 docs/img/structured-streaming-watermark.png| Bin 0 -> 252000 bytes
 docs/img/structured-streaming.pptx | Bin 1105413 -> 1113902 bytes
 docs/structured-streaming-programming-guide.md | 460 +++-
 3 files changed, 353 insertions(+), 107 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/80d583bd/docs/img/structured-streaming-watermark.png
--
diff --git a/docs/img/structured-streaming-watermark.png 
b/docs/img/structured-streaming-watermark.png
new file mode 100644
index 000..f21fbda
Binary files /dev/null and b/docs/img/structured-streaming-watermark.png differ

http://git-wip-us.apache.org/repos/asf/spark/blob/80d583bd/docs/img/structured-streaming.pptx
--
diff --git a/docs/img/structured-streaming.pptx 
b/docs/img/structured-streaming.pptx
index 6aad2ed..f5bdfc0 100644
Binary files a/docs/img/structured-streaming.pptx and 
b/docs/img/structured-streaming.pptx differ

http://git-wip-us.apache.org/repos/asf/spark/blob/80d583bd/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 77b66b3..3b7d0c4 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -10,7 +10,7 @@ title: Structured Streaming Programming Guide
 # Overview
 Structured Streaming is a scalable and fault-tolerant stream processing engine 
built on the Spark SQL engine. You can express your streaming computation the 
same way you would express a batch computation on static data.The Spark SQL 
engine will take care of running it incrementally and continuously and updating 
the final result as streaming data continues to arrive. You can use the 
[Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java or Python to 
express streaming aggregations, event-time windows, stream-to-batch joins, etc. 
The computation is executed on the same optimized Spark SQL engine. Finally, 
the system ensures end-to-end exactly-once fault-tolerance guarantees through 
checkpointing and Write Ahead Logs. In short, *Structured Streaming provides 
fast, scalable, fault-tolerant, end-to-end exactly-once stream processing 
without the user having to reason about streaming.*
 
-**Spark 2.0 is the ALPHA RELEASE of Structured Streaming** and the APIs are 
still experimental. In this guide, we are going to walk you through the 
programming model and the APIs. First, let's start with a simple exam

spark git commit: [SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

2016-12-28 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master 6a475ae46 -> 092c6725b


[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding 
watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of 
watermarking
- Extended the Output Mode section with a table showing the compatibility 
between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by 
StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png";>

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png";>


### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)


### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das 

Closes #16294 from tdas/SPARK-18669.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/092c6725
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/092c6725
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/092c6725

Branch: refs/heads/master
Commit: 092c6725bf039bf33299b53791e1958c4ea3f6aa
Parents: 6a475ae
Author: Tathagata Das 
Authored: Wed Dec 28 12:11:25 2016 -0800
Committer: Shixiong Zhu 
Committed: Wed Dec 28 12:11:25 2016 -0800

--
 docs/img/structured-streaming-watermark.png| Bin 0 -> 252000 bytes
 docs/img/structured-streaming.pptx | Bin 1105413 -> 1113902 bytes
 docs/structured-streaming-programming-guide.md | 460 +++-
 3 files changed, 353 insertions(+), 107 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/092c6725/docs/img/structured-streaming-watermark.png
--
diff --git a/docs/img/structured-streaming-watermark.png 
b/docs/img/structured-streaming-watermark.png
new file mode 100644
index 000..f21fbda
Binary files /dev/null and b/docs/img/structured-streaming-watermark.png differ

http://git-wip-us.apache.org/repos/asf/spark/blob/092c6725/docs/img/structured-streaming.pptx
--
diff --git a/docs/img/structured-streaming.pptx 
b/docs/img/structured-streaming.pptx
index 6aad2ed..f5bdfc0 100644
Binary files a/docs/img/structured-streaming.pptx and 
b/docs/img/structured-streaming.pptx differ

http://git-wip-us.apache.org/repos/asf/spark/blob/092c6725/docs/structured-streaming-programming-guide.md
--
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 77b66b3..3b7d0c4 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -10,7 +10,7 @@ title: Structured Streaming Programming Guide
 # Overview
 Structured Streaming is a scalable and fault-tolerant stream processing engine 
built on the Spark SQL engine. You can express your streaming computation the 
same way you would express a batch computation on static data.The Spark SQL 
engine will take care of running it incrementally and continuously and updating 
the final result as streaming data continues to arrive. You can use the 
[Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java or Python to 
express streaming aggregations, event-time windows, stream-to-batch joins, etc. 
The computation is executed on the same optimized Spark SQL engine. Finally, 
the system ensures end-to-end exactly-once fault-tolerance guarantees through 
checkpointing and Write Ahead Logs. In short, *Structured Streaming provides 
fast, scalable, fault-tolerant, end-to-end exactly-once stream processing 
without the user having to reason about streaming.*
 
-**Spark 2.0 is the ALPHA RELEASE of Structured Streaming** and the APIs are 
still experimental. In this guide, we are going to walk you through the 
programming model and the APIs. First, let's start with a simple example - a 
streaming word count. 
+**Structured Streaming is still ALPHA in Spark 2.1** and the APIs are stil

[spark] Git Push Summary

2016-12-28 Thread yhuai

Repository: spark
Updated Tags:  refs/tags/v2.1.0 [created] cd0a08361

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17772][ML][TEST] Add test functions for ML sample weights

2016-12-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master d7bce3bd3 -> 6a475ae46


[SPARK-17772][ML][TEST] Add test functions for ML sample weights

## What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested 
rather heterogeneously and with code duplication. This patch adds extensible 
helper methods to `MLTestingUtils` that can be reused by various algorithms 
accepting sample weights. Up to now, there seems to be a few tests that have 
been implemented commonly:

* Check that oversampling is the same as giving the instances sample weights 
proportional to the number of samples
* Check that outliers with tiny sample weights do not affect the algorithm's 
performance

This patch adds an additional test:

* Check that algorithms are invariant to constant scaling of the sample 
weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same 
as uniform sample weights with `w_i = 1` or `w_i = 0.0001`

The instances of these tests occurred in LinearRegression, NaiveBayes, and 
LogisticRegression. Those tests have been removed/modified to use the new 
helper methods. These helper functions will be of use when 
[SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.

## How was this patch tested?

This patch only involves modifying test suites.

## Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend 
`HasWeightCol`. I did not modify these test suites because it will make this 
patch easier to review, and because they did not duplicate the same tests as 
the three suites that were modified. If we want to change them later, we can 
create a JIRA for it now, but it's open for debate.

Author: sethah 

Closes #15721 from sethah/SPARK-17772.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a475ae4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a475ae4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a475ae4

Branch: refs/heads/master
Commit: 6a475ae466a7ce28d507244bf6db91be06ed81ef
Parents: d7bce3b
Author: sethah 
Authored: Wed Dec 28 07:01:14 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 28 07:01:14 2016 -0800

--
 .../LogisticRegressionSuite.scala   |  60 +++---
 .../ml/classification/NaiveBayesSuite.scala |  81 +
 .../ml/regression/LinearRegressionSuite.scala   | 120 ++-
 .../apache/spark/ml/util/MLTestingUtils.scala   | 111 +++--
 4 files changed, 154 insertions(+), 218 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6a475ae4/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index f8bcbee..1308210 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -1836,52 +1836,24 @@ class LogisticRegressionSuite
 .forall(x => x(0) >= x(1)))
   }
 
-  test("binary logistic regression with weighted data") {
-val numClasses = 2
-val numPoints = 40
-val outlierData = 
MLTestingUtils.genClassificationInstancesWithWeightedOutliers(spark,
-  numClasses, numPoints)
-val testData = Array.tabulate[LabeledPoint](numClasses) { i =>
-  LabeledPoint(i.toDouble, Vectors.dense(i.toDouble))
-}.toSeq.toDF()
-val lr = new 
LogisticRegression().setFamily("binomial").setWeightCol("weight")
-val model = lr.fit(outlierData)
-val results = model.transform(testData).select("label", 
"prediction").collect()
-
-// check that the predictions are the one to one mapping
-results.foreach { case Row(label: Double, pred: Double) =>
-  assert(label === pred)
+  test("logistic regression with sample weights") {
+def modelEquals(m1: LogisticRegressionModel, m2: LogisticRegressionModel): 
Unit = {
+  assert(m1.coefficientMatrix ~== m2.coefficientMatrix absTol 0.05)
+  assert(m1.interceptVector ~== m2.interceptVector absTol 0.05)
 }
-val (overSampledData, weightedData) =
-  MLTestingUtils.genEquivalentOversampledAndWeightedInstances(outlierData, 
"label", "features",
-42L)
-val weightedModel = lr.fit(weightedData)
-val overSampledModel = lr.setWeightCol("").fit(overSampledData)
-assert(weightedModel.coefficientMatrix ~== 
overSampledModel.coefficientMatrix relTol 0.01)
-  }
-
-  test("multinomial logistic regression with weighted data") {
-val numClasses = 5
-val numPoints =

spark git commit: [SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing Scala deps in spark-tags

2016-12-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f124d35e2 -> 5ed2f1c11


[SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing 
Scala deps in spark-tags

## What changes were proposed in this pull request?

This adds back a direct dependency on Scala library classes from spark-tags 
because its Scala annotations need them.

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #16418 from srowen/SPARK-18993.

(cherry picked from commit d7bce3bd31ec193274718042dc017706989d7563)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ed2f1c1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ed2f1c1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ed2f1c1

Branch: refs/heads/branch-2.0
Commit: 5ed2f1c1118762191918e08936376113e6324935
Parents: f124d35
Author: Sean Owen 
Authored: Wed Dec 28 12:17:33 2016 +
Committer: Sean Owen 
Committed: Wed Dec 28 12:17:53 2016 +

--
 common/tags/pom.xml | 8 
 1 file changed, 8 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5ed2f1c1/common/tags/pom.xml
--
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index ecdc9e0..ca68ee2 100644
--- a/common/tags/pom.xml
+++ b/common/tags/pom.xml
@@ -35,6 +35,14 @@
 tags
   
 
+  
+
+  org.scala-lang
+  scala-library
+  ${scala.version}
+
+  
+
   
 
target/scala-${scala.binary.version}/classes
 
target/scala-${scala.binary.version}/test-classes


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing Scala deps in spark-tags

2016-12-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 ac7107fe7 -> 7197a7bc7


[SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing 
Scala deps in spark-tags

## What changes were proposed in this pull request?

This adds back a direct dependency on Scala library classes from spark-tags 
because its Scala annotations need them.

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #16418 from srowen/SPARK-18993.

(cherry picked from commit d7bce3bd31ec193274718042dc017706989d7563)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7197a7bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7197a7bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7197a7bc

Branch: refs/heads/branch-2.1
Commit: 7197a7bc7061e2908b6430f494dba378378d5d02
Parents: ac7107f
Author: Sean Owen 
Authored: Wed Dec 28 12:17:33 2016 +
Committer: Sean Owen 
Committed: Wed Dec 28 12:17:41 2016 +

--
 common/tags/pom.xml | 8 
 1 file changed, 8 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7197a7bc/common/tags/pom.xml
--
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index 0778ee3..ad29848 100644
--- a/common/tags/pom.xml
+++ b/common/tags/pom.xml
@@ -34,6 +34,14 @@
 tags
   
 
+  
+
+  org.scala-lang
+  scala-library
+  ${scala.version}
+
+  
+
   
 
target/scala-${scala.binary.version}/classes
 
target/scala-${scala.binary.version}/test-classes


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing Scala deps in spark-tags

2016-12-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 2a5f52a71 -> d7bce3bd3


[SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing 
Scala deps in spark-tags

## What changes were proposed in this pull request?

This adds back a direct dependency on Scala library classes from spark-tags 
because its Scala annotations need them.

## How was this patch tested?

Existing tests

Author: Sean Owen 

Closes #16418 from srowen/SPARK-18993.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7bce3bd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7bce3bd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7bce3bd

Branch: refs/heads/master
Commit: d7bce3bd31ec193274718042dc017706989d7563
Parents: 2a5f52a
Author: Sean Owen 
Authored: Wed Dec 28 12:17:33 2016 +
Committer: Sean Owen 
Committed: Wed Dec 28 12:17:33 2016 +

--
 common/tags/pom.xml | 8 
 1 file changed, 8 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d7bce3bd/common/tags/pom.xml
--
diff --git a/common/tags/pom.xml b/common/tags/pom.xml
index 09f6fa1..9345dc8 100644
--- a/common/tags/pom.xml
+++ b/common/tags/pom.xml
@@ -34,6 +34,14 @@
 tags
   
 
+  
+
+  org.scala-lang
+  scala-library
+  ${scala.version}
+
+  
+
   
 
target/scala-${scala.binary.version}/classes
 
target/scala-${scala.binary.version}/test-classes


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Fix doc of ForeachWriter to use writeStream

2016-12-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 76e9bd748 -> 2a5f52a71


[MINOR][DOC] Fix doc of ForeachWriter to use writeStream

## What changes were proposed in this pull request?

Fix the document of `ForeachWriter` to use `writeStream` instead of `write` for 
a streaming dataset.

## How was this patch tested?
Docs only.

Author: Carson Wang 

Closes #16419 from carsonwang/FixDoc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a5f52a7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a5f52a7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a5f52a7

Branch: refs/heads/master
Commit: 2a5f52a7146abc05bf70e65eb2267cd869ac4789
Parents: 76e9bd7
Author: Carson Wang 
Authored: Wed Dec 28 12:12:44 2016 +
Committer: Sean Owen 
Committed: Wed Dec 28 12:12:44 2016 +

--
 sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2a5f52a7/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
index b94ad59..372ec26 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
@@ -28,7 +28,7 @@ import org.apache.spark.annotation.{Experimental, 
InterfaceStability}
  *
  * Scala example:
  * {{{
- *   datasetOfString.write.foreach(new ForeachWriter[String] {
+ *   datasetOfString.writeStream.foreach(new ForeachWriter[String] {
  *
  * def open(partitionId: Long, version: Long): Boolean = {
  *   // open connection
@@ -46,7 +46,7 @@ import org.apache.spark.annotation.{Experimental, 
InterfaceStability}
  *
  * Java example:
  * {{{
- *  datasetOfString.write().foreach(new ForeachWriter() {
+ *  datasetOfString.writeStream().foreach(new ForeachWriter() {
  *
  *@Override
  *public boolean open(long partitionId, long version) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Fix doc of ForeachWriter to use writeStream

2016-12-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 ca25b1e51 -> ac7107fe7


[MINOR][DOC] Fix doc of ForeachWriter to use writeStream

## What changes were proposed in this pull request?

Fix the document of `ForeachWriter` to use `writeStream` instead of `write` for 
a streaming dataset.

## How was this patch tested?
Docs only.

Author: Carson Wang 

Closes #16419 from carsonwang/FixDoc.

(cherry picked from commit 2a5f52a7146abc05bf70e65eb2267cd869ac4789)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac7107fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac7107fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac7107fe

Branch: refs/heads/branch-2.1
Commit: ac7107fe70fcd0b584001c10dd624a4d8757109c
Parents: ca25b1e
Author: Carson Wang 
Authored: Wed Dec 28 12:12:44 2016 +
Committer: Sean Owen 
Committed: Wed Dec 28 12:12:54 2016 +

--
 sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ac7107fe/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
index b94ad59..372ec26 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala
@@ -28,7 +28,7 @@ import org.apache.spark.annotation.{Experimental, 
InterfaceStability}
  *
  * Scala example:
  * {{{
- *   datasetOfString.write.foreach(new ForeachWriter[String] {
+ *   datasetOfString.writeStream.foreach(new ForeachWriter[String] {
  *
  * def open(partitionId: Long, version: Long): Boolean = {
  *   // open connection
@@ -46,7 +46,7 @@ import org.apache.spark.annotation.{Experimental, 
InterfaceStability}
  *
  * Java example:
  * {{{
- *  datasetOfString.write().foreach(new ForeachWriter() {
+ *  datasetOfString.writeStream().foreach(new ForeachWriter() {
  *
  *@Override
  *public boolean open(long partitionId, long version) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-18960][SQL][SS] Avoid double reading file which is being copied.

2016-12-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 67fb33e7e -> 76e9bd748


[SPARK-18960][SQL][SS] Avoid double reading file which is being copied.

## What changes were proposed in this pull request?

In HDFS, when we copy a file into target directory, there will a temporary 
`._COPY_` file for a period of time. The duration depends on file size. If we 
do not skip this file, we will may read the same data for two times.

## How was this patch tested?
update unit test

Author: uncleGen 

Closes #16370 from uncleGen/SPARK-18960.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76e9bd74
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76e9bd74
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76e9bd74

Branch: refs/heads/master
Commit: 76e9bd74885a99462ed0957aad37cbead7f14de2
Parents: 67fb33e
Author: uncleGen 
Authored: Wed Dec 28 10:42:47 2016 +
Committer: Sean Owen 
Committed: Wed Dec 28 10:42:47 2016 +

--
 .../datasources/PartitioningAwareFileIndex.scala | 11 ---
 .../spark/sql/execution/datasources/FileIndexSuite.scala |  1 +
 2 files changed, 9 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/76e9bd74/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
index 825a0f7..82c1599 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
@@ -439,10 +439,15 @@ object PartitioningAwareFileIndex extends Logging {
 
   /** Checks if we should filter out this path name. */
   def shouldFilterOut(pathName: String): Boolean = {
-// We filter everything that starts with _ and ., except _common_metadata 
and _metadata
+// We filter follow paths:
+// 1. everything that starts with _ and ., except _common_metadata and 
_metadata
 // because Parquet needs to find those metadata files from leaf files 
returned by this method.
 // We should refactor this logic to not mix metadata files with data files.
-((pathName.startsWith("_") && !pathName.contains("=")) || 
pathName.startsWith(".")) &&
-  !pathName.startsWith("_common_metadata") && 
!pathName.startsWith("_metadata")
+// 2. everything that ends with `._COPYING_`, because this is a 
intermediate state of file. we
+// should skip this file in case of double reading.
+val exclude = (pathName.startsWith("_") && !pathName.contains("=")) ||
+  pathName.startsWith(".") || pathName.endsWith("._COPYING_")
+val include = pathName.startsWith("_common_metadata") || 
pathName.startsWith("_metadata")
+exclude && !include
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/76e9bd74/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
index b7a472b..2b4c9f3 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
@@ -142,6 +142,7 @@ class FileIndexSuite extends SharedSQLContext {
 assert(!PartitioningAwareFileIndex.shouldFilterOut("_common_metadata"))
 assert(PartitioningAwareFileIndex.shouldFilterOut("_ab_metadata"))
 assert(PartitioningAwareFileIndex.shouldFilterOut("_cd_common_metadata"))
+assert(PartitioningAwareFileIndex.shouldFilterOut("a._COPYING_"))
   }
 
   test("SPARK-17613 - PartitioningAwareFileIndex: base path w/o '/' at end") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19010][CORE] Include Kryo exception in case of overflow

2016-12-28 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 9cff67f34 -> 67fb33e7e


[SPARK-19010][CORE] Include Kryo exception in case of overflow

## What changes were proposed in this pull request?

This is to workaround an implicit result of #4947 which suppressed the
original Kryo exception if the overflow happened during serialization.

## How was this patch tested?

`KryoSerializerSuite` was augmented to reflect this change.

Author: Sergei Lebedev 

Closes #16416 from superbobry/patch-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/67fb33e7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/67fb33e7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/67fb33e7

Branch: refs/heads/master
Commit: 67fb33e7e078eef3ecd5dcbfc26659b6fe2d054e
Parents: 9cff67f
Author: Sergei Lebedev 
Authored: Wed Dec 28 10:30:38 2016 +
Committer: Sean Owen 
Committed: Wed Dec 28 10:30:38 2016 +

--
 .../main/scala/org/apache/spark/serializer/KryoSerializer.scala   | 2 +-
 .../scala/org/apache/spark/serializer/KryoSerializerSuite.scala   | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/67fb33e7/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala 
b/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
index 7eb2da1..0381563 100644
--- a/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
+++ b/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
@@ -313,7 +313,7 @@ private[spark] class KryoSerializerInstance(ks: 
KryoSerializer, useUnsafe: Boole
 } catch {
   case e: KryoException if e.getMessage.startsWith("Buffer overflow") =>
 throw new SparkException(s"Kryo serialization failed: ${e.getMessage}. 
To avoid this, " +
-  "increase spark.kryoserializer.buffer.max value.")
+  "increase spark.kryoserializer.buffer.max value.", e)
 } finally {
   releaseKryo(kryo)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/67fb33e7/core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala
--
diff --git 
a/core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala 
b/core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala
index 5040841..a30653b 100644
--- a/core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala
@@ -23,7 +23,7 @@ import scala.collection.JavaConverters._
 import scala.collection.mutable
 import scala.reflect.ClassTag
 
-import com.esotericsoftware.kryo.Kryo
+import com.esotericsoftware.kryo.{Kryo, KryoException}
 import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => KryoOutput}
 import org.roaringbitmap.RoaringBitmap
 
@@ -351,6 +351,7 @@ class KryoSerializerSuite extends SparkFunSuite with 
SharedSparkContext {
 val ser = new KryoSerializer(conf).newInstance()
 val thrown = intercept[SparkException](ser.serialize(largeObject))
 assert(thrown.getMessage.contains(kryoBufferMaxProperty))
+assert(thrown.getCause.isInstanceOf[KryoException])
   }
 
   test("SPARK-1: deserialize RoaringBitmap throw Buffer underflow 
exception") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][ML] Correct test cases of LoR raw2prediction & probability2prediction.

2016-12-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 79ff85363 -> 9cff67f34


[MINOR][ML] Correct test cases of LoR raw2prediction & probability2prediction.

## What changes were proposed in this pull request?
Correct test cases of ```LogisticRegression``` raw2prediction & 
probability2prediction.

## How was this patch tested?
Changed unit tests.

Author: Yanbo Liang 

Closes #16407 from yanboliang/raw-probability.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9cff67f3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9cff67f3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9cff67f3

Branch: refs/heads/master
Commit: 9cff67f3465bc6ffe1b5abee9501e3c17f8fd194
Parents: 79ff853
Author: Yanbo Liang 
Authored: Wed Dec 28 01:24:18 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 28 01:24:18 2016 -0800

--
 .../LogisticRegressionSuite.scala   | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9cff67f3/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
index 9c4c59a..f8bcbee 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
@@ -359,8 +359,16 @@ class LogisticRegressionSuite
 assert(pred == predFromProb)
 }
 
-// force it to use probability2prediction
+// force it to use raw2prediction
 model.setProbabilityCol("")
+val resultsUsingRaw2Predict =
+  
model.transform(smallMultinomialDataset).select("prediction").as[Double].collect()
+
resultsUsingRaw2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {
+  case (pred1, pred2) => assert(pred1 === pred2)
+}
+
+// force it to use probability2prediction
+model.setRawPredictionCol("")
 val resultsUsingProb2Predict =
   
model.transform(smallMultinomialDataset).select("prediction").as[Double].collect()
 
resultsUsingProb2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {
@@ -405,8 +413,16 @@ class LogisticRegressionSuite
 assert(pred == predFromProb)
 }
 
-// force it to use probability2prediction
+// force it to use raw2prediction
 model.setProbabilityCol("")
+val resultsUsingRaw2Predict =
+  
model.transform(smallBinaryDataset).select("prediction").as[Double].collect()
+
resultsUsingRaw2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {
+  case (pred1, pred2) => assert(pred1 === pred2)
+}
+
+// force it to use probability2prediction
+model.setRawPredictionCol("")
 val resultsUsingProb2Predict =
   
model.transform(smallBinaryDataset).select("prediction").as[Double].collect()
 
resultsUsingProb2Predict.zip(results.select("prediction").as[Double].collect()).foreach
 {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)

2016-12-28 Thread yliang

Repository: spark
Updated Branches:
  refs/heads/master 2af8b5cff -> 79ff85363


[SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery 
Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 
25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure 
in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, 
or type I errors, among all the hypotheses when performing multiple hypotheses 
tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is 
implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: Peng 
Author: Peng, Meng 

Closes #15212 from mpjlu/fdr_fwe.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79ff8536
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79ff8536
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79ff8536

Branch: refs/heads/master
Commit: 79ff8536315aef97ee940c52d71cd8de777c7ce6
Parents: 2af8b5c
Author: Peng 
Authored: Wed Dec 28 00:49:36 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 28 00:49:36 2016 -0800

--
 docs/ml-features.md |   6 +-
 docs/mllib-feature-extraction.md|   4 +-
 .../apache/spark/ml/feature/ChiSqSelector.scala |  48 +-
 .../spark/mllib/api/python/PythonMLLibAPI.scala |   4 +
 .../spark/mllib/feature/ChiSqSelector.scala |  62 ++--
 .../spark/ml/feature/ChiSqSelectorSuite.scala   |   6 +
 .../mllib/feature/ChiSqSelectorSuite.scala  | 147 +++
 python/pyspark/ml/feature.py|  74 +-
 python/pyspark/mllib/feature.py |  50 ++-
 9 files changed, 337 insertions(+), 64 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index ca1ccc4..1d34497 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1423,12 +1423,12 @@ for more details on the API.
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
 categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
-
+features to choose. It supports five selection methods: `numTopFeatures`, 
`percentile`, `fpr`, `fdr`, `fwe`:
 * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test. This is akin to yielding the features with the most 
predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all 
features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
-
+* `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values is below a threshold, thus 
controlling the family-wise error rate of selection.
 By default, the selection method is `numTopFeatures`, with the default number 
of top features set to 50.
 The user can choose a selection method using `setSelectorType`.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/docs/mllib-feature-extraction.md
--
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 42568c3..acd2894 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -227,11 +227,13 @@ both speed and statistical learning behavior.
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
 Chi-Squared feature selection. It operates on labeled data with categorical 
features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://e

40 matches

Mail list logo