date:20160501

spark git commit: Fix reference to external metrics documentation

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 7d63c36e1 -> ccb53a20e


Fix reference to external metrics documentation

Author: Ben McCann 

Closes #12833 from benmccann/patch-1.

(cherry picked from commit 214d1be4fd4a34399b6a2adb2618784de459a48d)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ccb53a20
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ccb53a20
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ccb53a20

Branch: refs/heads/branch-2.0
Commit: ccb53a20e4c3bff02a17542cad13a1fe36d7a7ea
Parents: 7d63c36
Author: Ben McCann 
Authored: Sun May 1 22:43:28 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 22:45:21 2016 -0700

--
 docs/monitoring.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ccb53a20/docs/monitoring.md
--
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 9912cde..88002eb 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -341,7 +341,7 @@ keep the paths consistent in both modes.
 # Metrics
 
 Spark has a configurable metrics system based on the
-[Coda Hale Metrics Library](http://metrics.codahale.com/).
+[Dropwizard Metrics Library](http://metrics.dropwizard.io/).
 This allows users to report Spark metrics to a variety of sinks including 
HTTP, JMX, and CSV
 files. The metrics system is configured via a configuration file that Spark 
expects to be present
 at `$SPARK_HOME/conf/metrics.properties`. A custom file location can be 
specified via the


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Fix reference to external metrics documentation

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 44da8d8ea -> 214d1be4f


Fix reference to external metrics documentation

Author: Ben McCann 

Closes #12833 from benmccann/patch-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/214d1be4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/214d1be4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/214d1be4

Branch: refs/heads/master
Commit: 214d1be4fd4a34399b6a2adb2618784de459a48d
Parents: 44da8d8
Author: Ben McCann 
Authored: Sun May 1 22:43:28 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 22:43:28 2016 -0700

--
 docs/monitoring.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/214d1be4/docs/monitoring.md
--
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 9912cde..88002eb 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -341,7 +341,7 @@ keep the paths consistent in both modes.
 # Metrics
 
 Spark has a configurable metrics system based on the
-[Coda Hale Metrics Library](http://metrics.codahale.com/).
+[Dropwizard Metrics Library](http://metrics.dropwizard.io/).
 This allows users to report Spark metrics to a variety of sinks including 
HTTP, JMX, and CSV
 files. The metrics system is configured via a configuration file that Spark 
expects to be present
 at `$SPARK_HOME/conf/metrics.properties`. A custom file location can be 
specified via the


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15049] Rename NewAccumulator to AccumulatorV2

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 705172202 -> 7d63c36e1


[SPARK-15049] Rename NewAccumulator to AccumulatorV2

## What changes were proposed in this pull request?
NewAccumulator isn't the best name if we ever come up with v3 of the API.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin 

Closes #12827 from rxin/SPARK-15049.

(cherry picked from commit 44da8d8eabeccc12bfed0d43b37d54e0da845c66)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d63c36e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d63c36e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d63c36e

Branch: refs/heads/branch-2.0
Commit: 7d63c36e1efe8baec96cdc16a997249728e204fd
Parents: 7051722
Author: Reynold Xin 
Authored: Sun May 1 20:21:02 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 20:21:11 2016 -0700

--
 .../scala/org/apache/spark/AccumulatorV2.scala  | 394 +++
 .../scala/org/apache/spark/ContextCleaner.scala |   2 +-
 .../org/apache/spark/HeartbeatReceiver.scala|   2 +-
 .../scala/org/apache/spark/NewAccumulator.scala | 393 --
 .../scala/org/apache/spark/SparkContext.scala   |   4 +-
 .../scala/org/apache/spark/TaskContext.scala|   2 +-
 .../org/apache/spark/TaskContextImpl.scala  |   2 +-
 .../scala/org/apache/spark/TaskEndReason.scala  |   4 +-
 .../org/apache/spark/executor/Executor.scala|   4 +-
 .../org/apache/spark/executor/TaskMetrics.scala |  18 +-
 .../apache/spark/scheduler/DAGScheduler.scala   |  10 +-
 .../spark/scheduler/DAGSchedulerEvent.scala |   2 +-
 .../scala/org/apache/spark/scheduler/Task.scala |   2 +-
 .../org/apache/spark/scheduler/TaskResult.scala |   8 +-
 .../apache/spark/scheduler/TaskScheduler.scala  |   4 +-
 .../spark/scheduler/TaskSchedulerImpl.scala |   2 +-
 .../apache/spark/scheduler/TaskSetManager.scala |   2 +-
 .../org/apache/spark/AccumulatorSuite.scala |   2 +-
 .../apache/spark/InternalAccumulatorSuite.scala |   2 +-
 .../spark/executor/TaskMetricsSuite.scala   |   6 +-
 .../spark/scheduler/DAGSchedulerSuite.scala |   6 +-
 .../scheduler/ExternalClusterManagerSuite.scala |   4 +-
 .../spark/scheduler/TaskSetManagerSuite.scala   |   6 +-
 .../spark/sql/execution/metric/SQLMetrics.scala |   6 +-
 24 files changed, 444 insertions(+), 443 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7d63c36e/core/src/main/scala/org/apache/spark/AccumulatorV2.scala
--
diff --git a/core/src/main/scala/org/apache/spark/AccumulatorV2.scala 
b/core/src/main/scala/org/apache/spark/AccumulatorV2.scala
new file mode 100644
index 000..c65108a
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/AccumulatorV2.scala
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import java.{lang => jl}
+import java.io.ObjectInputStream
+import java.util.concurrent.ConcurrentHashMap
+import java.util.concurrent.atomic.AtomicLong
+
+import org.apache.spark.scheduler.AccumulableInfo
+import org.apache.spark.util.Utils
+
+
+private[spark] case class AccumulatorMetadata(
+id: Long,
+name: Option[String],
+countFailedValues: Boolean) extends Serializable
+
+
+/**
+ * The base class for accumulators, that can accumulate inputs of type `IN`, 
and produce output of
+ * type `OUT`.
+ */
+abstract class AccumulatorV2[IN, OUT] extends Serializable {
+  private[spark] var metadata: AccumulatorMetadata = _
+  private[this] var atDriverSide = true
+
+  private[spark] def register(
+  sc: SparkContext,
+  name: Option[String] = None,
+  countFailedValues: Boolean = false): Unit = {
+if (this.metadata != null) {
+  throw new IllegalStateException("Cannot register an Accumulator twice.")
+}
+this.metadata = AccumulatorMetadata(AccumulatorContext.newId(), name, 
countFailedValues)
+AccumulatorContext.register(this)
+sc.clean

spark git commit: [SPARK-15049] Rename NewAccumulator to AccumulatorV2

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master a832cef11 -> 44da8d8ea


[SPARK-15049] Rename NewAccumulator to AccumulatorV2

## What changes were proposed in this pull request?
NewAccumulator isn't the best name if we ever come up with v3 of the API.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin 

Closes #12827 from rxin/SPARK-15049.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/44da8d8e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/44da8d8e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/44da8d8e

Branch: refs/heads/master
Commit: 44da8d8eabeccc12bfed0d43b37d54e0da845c66
Parents: a832cef
Author: Reynold Xin 
Authored: Sun May 1 20:21:02 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 20:21:02 2016 -0700

--
 .../scala/org/apache/spark/AccumulatorV2.scala  | 394 +++
 .../scala/org/apache/spark/ContextCleaner.scala |   2 +-
 .../org/apache/spark/HeartbeatReceiver.scala|   2 +-
 .../scala/org/apache/spark/NewAccumulator.scala | 393 --
 .../scala/org/apache/spark/SparkContext.scala   |   4 +-
 .../scala/org/apache/spark/TaskContext.scala|   2 +-
 .../org/apache/spark/TaskContextImpl.scala  |   2 +-
 .../scala/org/apache/spark/TaskEndReason.scala  |   4 +-
 .../org/apache/spark/executor/Executor.scala|   4 +-
 .../org/apache/spark/executor/TaskMetrics.scala |  18 +-
 .../apache/spark/scheduler/DAGScheduler.scala   |  10 +-
 .../spark/scheduler/DAGSchedulerEvent.scala |   2 +-
 .../scala/org/apache/spark/scheduler/Task.scala |   2 +-
 .../org/apache/spark/scheduler/TaskResult.scala |   8 +-
 .../apache/spark/scheduler/TaskScheduler.scala  |   4 +-
 .../spark/scheduler/TaskSchedulerImpl.scala |   2 +-
 .../apache/spark/scheduler/TaskSetManager.scala |   2 +-
 .../org/apache/spark/AccumulatorSuite.scala |   2 +-
 .../apache/spark/InternalAccumulatorSuite.scala |   2 +-
 .../spark/executor/TaskMetricsSuite.scala   |   6 +-
 .../spark/scheduler/DAGSchedulerSuite.scala |   6 +-
 .../scheduler/ExternalClusterManagerSuite.scala |   4 +-
 .../spark/scheduler/TaskSetManagerSuite.scala   |   6 +-
 .../spark/sql/execution/metric/SQLMetrics.scala |   6 +-
 24 files changed, 444 insertions(+), 443 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/44da8d8e/core/src/main/scala/org/apache/spark/AccumulatorV2.scala
--
diff --git a/core/src/main/scala/org/apache/spark/AccumulatorV2.scala 
b/core/src/main/scala/org/apache/spark/AccumulatorV2.scala
new file mode 100644
index 000..c65108a
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/AccumulatorV2.scala
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import java.{lang => jl}
+import java.io.ObjectInputStream
+import java.util.concurrent.ConcurrentHashMap
+import java.util.concurrent.atomic.AtomicLong
+
+import org.apache.spark.scheduler.AccumulableInfo
+import org.apache.spark.util.Utils
+
+
+private[spark] case class AccumulatorMetadata(
+id: Long,
+name: Option[String],
+countFailedValues: Boolean) extends Serializable
+
+
+/**
+ * The base class for accumulators, that can accumulate inputs of type `IN`, 
and produce output of
+ * type `OUT`.
+ */
+abstract class AccumulatorV2[IN, OUT] extends Serializable {
+  private[spark] var metadata: AccumulatorMetadata = _
+  private[this] var atDriverSide = true
+
+  private[spark] def register(
+  sc: SparkContext,
+  name: Option[String] = None,
+  countFailedValues: Boolean = false): Unit = {
+if (this.metadata != null) {
+  throw new IllegalStateException("Cannot register an Accumulator twice.")
+}
+this.metadata = AccumulatorMetadata(AccumulatorContext.newId(), name, 
countFailedValues)
+AccumulatorContext.register(this)
+sc.cleaner.foreach(_.registerAccumulatorForCleanup(this))
+  }
+
+  /**
+   * Returns true if this accumulator has

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 a6428292f -> 705172202


[SPARK-13425][SQL] Documentation for CSV datasource options

## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and 
writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon 
Author: Hyukjin Kwon 

Closes #12817 from HyukjinKwon/SPARK-13425.

(cherry picked from commit a832cef11233c6357c7ba7ede387b432e6b0ed71)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70517220
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70517220
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70517220

Branch: refs/heads/branch-2.0
Commit: 7051722023b98f1720142c7b3b41948d275ea455
Parents: a642829
Author: hyukjinkwon 
Authored: Sun May 1 19:05:20 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 19:05:32 2016 -0700

--
 python/pyspark/sql/readwriter.py| 52 
 .../org/apache/spark/sql/DataFrameReader.scala  | 47 --
 .../org/apache/spark/sql/DataFrameWriter.scala  |  8 +++
 3 files changed, 103 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index ed9e716..cc5e93d 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -282,6 +282,45 @@ class DataFrameReader(object):
 
 :param paths: string, or list of strings, for input path(s).
 
+You can set the following CSV-specific options to deal with CSV files:
+* ``sep`` (default ``,``): sets the single character as a 
separator \
+for each field and value.
+* ``charset`` (default ``UTF-8``): decodes the CSV files by the 
given \
+encoding type.
+* ``quote`` (default ``"``): sets the single character used for 
escaping \
+quoted values where the separator can be part of the value.
+* ``escape`` (default ``\``): sets the single character used for 
escaping quotes \
+inside an already quoted value.
+* ``comment`` (default empty string): sets the single character 
used for skipping \
+lines beginning with this character. By default, it is 
disabled.
+* ``header`` (default ``false``): uses the first line as names of 
columns.
+* ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether 
or not leading \
+whitespaces from values being read should be skipped.
+* ``ignoreTrailingWhiteSpace`` (default ``false``): defines 
whether or not trailing \
+whitespaces from values being read should be skipped.
+* ``nullValue`` (default empty string): sets the string 
representation of a null value.
+* ``nanValue`` (default ``NaN``): sets the string representation 
of a non-number \
+value.
+* ``positiveInf`` (default ``Inf``): sets the string 
representation of a positive \
+infinity value.
+* ``negativeInf`` (default ``-Inf``): sets the string 
representation of a negative \
+infinity value.
+* ``dateFormat`` (default ``None``): sets the string that 
indicates a date format. \
+Custom date formats follow the formats at 
``java.text.SimpleDateFormat``. This \
+applies to both date type and timestamp type. By default, it 
is None which means \
+trying to parse times and date by 
``java.sql.Timestamp.valueOf()`` and \
+``java.sql.Date.valueOf()``.
+* ``maxColumns`` (default ``20480``): defines a hard limit of how 
many columns \
+a record can have.
+* ``maxCharsPerColumn`` (default ``100``): defines the maximum 
number of \
+characters allowed for any given value being read.
+* ``mode`` (default ``PERMISSIVE``): allows a mode for dealing 
with corrupt records \
+during parsing.
+* ``PERMISSIVE`` : sets other fields to ``null`` when it meets 
a corrupted record. \
+When a schema is set by user, it sets ``null`` for extra 
fields.
+* ``DROPMALFORMED`` : ignores the whole corrupted records.
+* ``FAILFAST`` : throws an exception when it meets corrupted 
records.
+
 >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
 >>> df.dtypes
 [('C0', 'str

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master a6428292f -> a832cef11


[SPARK-13425][SQL] Documentation for CSV datasource options

## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and 
writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon 
Author: Hyukjin Kwon 

Closes #12817 from HyukjinKwon/SPARK-13425.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a832cef1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a832cef1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a832cef1

Branch: refs/heads/master
Commit: a832cef11233c6357c7ba7ede387b432e6b0ed71
Parents: a642829
Author: hyukjinkwon 
Authored: Sun May 1 19:05:20 2016 -0700
Committer: Reynold Xin 
Committed: Sun May 1 19:05:20 2016 -0700

--
 python/pyspark/sql/readwriter.py| 52 
 .../org/apache/spark/sql/DataFrameReader.scala  | 47 --
 .../org/apache/spark/sql/DataFrameWriter.scala  |  8 +++
 3 files changed, 103 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a832cef1/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index ed9e716..cc5e93d 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -282,6 +282,45 @@ class DataFrameReader(object):
 
 :param paths: string, or list of strings, for input path(s).
 
+You can set the following CSV-specific options to deal with CSV files:
+* ``sep`` (default ``,``): sets the single character as a 
separator \
+for each field and value.
+* ``charset`` (default ``UTF-8``): decodes the CSV files by the 
given \
+encoding type.
+* ``quote`` (default ``"``): sets the single character used for 
escaping \
+quoted values where the separator can be part of the value.
+* ``escape`` (default ``\``): sets the single character used for 
escaping quotes \
+inside an already quoted value.
+* ``comment`` (default empty string): sets the single character 
used for skipping \
+lines beginning with this character. By default, it is 
disabled.
+* ``header`` (default ``false``): uses the first line as names of 
columns.
+* ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether 
or not leading \
+whitespaces from values being read should be skipped.
+* ``ignoreTrailingWhiteSpace`` (default ``false``): defines 
whether or not trailing \
+whitespaces from values being read should be skipped.
+* ``nullValue`` (default empty string): sets the string 
representation of a null value.
+* ``nanValue`` (default ``NaN``): sets the string representation 
of a non-number \
+value.
+* ``positiveInf`` (default ``Inf``): sets the string 
representation of a positive \
+infinity value.
+* ``negativeInf`` (default ``-Inf``): sets the string 
representation of a negative \
+infinity value.
+* ``dateFormat`` (default ``None``): sets the string that 
indicates a date format. \
+Custom date formats follow the formats at 
``java.text.SimpleDateFormat``. This \
+applies to both date type and timestamp type. By default, it 
is None which means \
+trying to parse times and date by 
``java.sql.Timestamp.valueOf()`` and \
+``java.sql.Date.valueOf()``.
+* ``maxColumns`` (default ``20480``): defines a hard limit of how 
many columns \
+a record can have.
+* ``maxCharsPerColumn`` (default ``100``): defines the maximum 
number of \
+characters allowed for any given value being read.
+* ``mode`` (default ``PERMISSIVE``): allows a mode for dealing 
with corrupt records \
+during parsing.
+* ``PERMISSIVE`` : sets other fields to ``null`` when it meets 
a corrupted record. \
+When a schema is set by user, it sets ``null`` for extra 
fields.
+* ``DROPMALFORMED`` : ignores the whole corrupted records.
+* ``FAILFAST`` : throws an exception when it meets corrupted 
records.
+
 >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
 >>> df.dtypes
 [('C0', 'string'), ('C1', 'string')]
@@ -663,6 +702,19 @@ class DataFrameWriter(object):

[spark] Git Push Summary

2016-05-01 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 [created] a6428292f

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update

2016-05-01 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master cdf9e9753 -> a6428292f


[SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark 
and PySpark - update

## What changes were proposed in this pull request?

This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking 
default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
  * This includes changing classes taking weightCol to treat unset and empty 
String Param values the same way.

Defaults changed:
* Scala
 * LogisticRegression: weightCol defaults to not set (instead of empty string)
 * StringIndexer: labels default to not set (instead of empty array)
 * GeneralizedLinearRegression:
   * maxIter always defaults to 25 (simpler than defaulting to 25 for a 
particular solver)
   * weightCol defaults to not set (instead of empty string)
 * LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
 * MultilayerPerceptron: layers default to not set (instead of [1,1])
 * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)

## How was this patch tested?

Generic unit test.  Manually tested that unit test by changing defaults and 
verifying that broke the test.

Author: Joseph K. Bradley 
Author: yinxusen 

Closes #12816 from jkbradley/yinxusen-SPARK-14931.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6428292
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6428292
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6428292

Branch: refs/heads/master
Commit: a6428292f78fd594f41a4a7bf254d40268f46305
Parents: cdf9e97
Author: Xusen Yin 
Authored: Sun May 1 12:29:01 2016 -0700
Committer: Joseph K. Bradley 
Committed: Sun May 1 12:29:01 2016 -0700

--
 .../ml/classification/LogisticRegression.scala  |  7 ++-
 .../apache/spark/ml/feature/StringIndexer.scala |  5 +-
 .../GeneralizedLinearRegression.scala   | 31 +++--
 .../spark/ml/regression/LinearRegression.scala  | 15 +++---
 .../LogisticRegressionSuite.scala   |  2 +-
 .../GeneralizedLinearRegressionSuite.scala  |  2 +-
 python/pyspark/ml/classification.py | 13 ++
 python/pyspark/ml/feature.py|  1 +
 python/pyspark/ml/regression.py |  9 ++--
 python/pyspark/ml/tests.py  | 48 
 python/pyspark/ml/wrapper.py|  3 +-
 11 files changed, 96 insertions(+), 40 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a6428292/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 717e93c..d2d4e24 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -235,13 +235,12 @@ class LogisticRegression @Since("1.2.0") (
 
   /**
* Whether to over-/under-sample training instances according to the given 
weights in weightCol.
-   * If empty, all instances are treated equally (weight 1.0).
-   * Default is empty, so all instances have weight one.
+   * If not set or empty String, all instances are treated equally (weight 
1.0).
+   * Default is not set, so all instances have weight one.
* @group setParam
*/
   @Since("1.6.0")
   def setWeightCol(value: String): this.type = set(weightCol, value)
-  setDefault(weightCol -> "")
 
   @Since("1.5.0")
   override def setThresholds(value: Array[Double]): this.type = 
super.setThresholds(value)
@@ -264,7 +263,7 @@ class LogisticRegression @Since("1.2.0") (
 
   protected[spark] def train(dataset: Dataset[_], handlePersistence: Boolean):
   LogisticRegressionModel = {
-val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol))
+val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol))
 val instances: RDD[Instance] =
   dataset.select(col($(labelCol)).cast(DoubleType), w, 
col($(featuresCol))).rdd.map {
 case Row(label: Double, weight: Double, features: Vector) =>

http://git-wip-us.apache.org/repos/asf/spark/blob/a6428292/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
index 7e0d374..cc0571f 100644
--- a/m

spark git commit: [SPARK-14505][CORE] Fix bug : creating two SparkContext objects in the same jvm, the first one will can not run any task!

2016-05-01 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 90787de86 -> cdf9e9753


[SPARK-14505][CORE] Fix bug : creating two SparkContext objects in the same 
jvm, the first one will can not run any task!

After creating two SparkContext objects in the same jvm(the second one can not 
be created successfully!),
use the first one to run job will throw exception like below:

![image](https://cloud.githubusercontent.com/assets/7162889/14402832/0c8da2a6-fe73-11e5-8aba-68ee3ddaf605.png)

Author: Allen 

Closes #12273 from the-sea/context-create-bug.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cdf9e975
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cdf9e975
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cdf9e975

Branch: refs/heads/master
Commit: cdf9e9753df4e7f2fa4e972d1bfded4e22943c27
Parents: 90787de
Author: Allen 
Authored: Sun May 1 15:39:14 2016 +0100
Committer: Sean Owen 
Committed: Sun May 1 15:39:14 2016 +0100

--
 .../scala/org/apache/spark/SparkContext.scala   | 27 +---
 .../org/apache/spark/SparkContextSuite.scala|  4 +++
 2 files changed, 16 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cdf9e975/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index ed4408c..2cb3ed0 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -2216,21 +2216,7 @@ object SparkContext extends Logging {
   sc: SparkContext,
   allowMultipleContexts: Boolean): Unit = {
 SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
-  contextBeingConstructed.foreach { otherContext =>
-if (otherContext ne sc) {  // checks for reference equality
-  // Since otherContext might point to a partially-constructed 
context, guard against
-  // its creationSite field being null:
-  val otherContextCreationSite =
-
Option(otherContext.creationSite).map(_.longForm).getOrElse("unknown location")
-  val warnMsg = "Another SparkContext is being constructed (or threw 
an exception in its" +
-" constructor).  This may indicate an error, since only one 
SparkContext may be" +
-" running in this JVM (see SPARK-2243)." +
-s" The other SparkContext was created 
at:\n$otherContextCreationSite"
-  logWarning(warnMsg)
-}
-
-if (activeContext.get() != null) {
-  val ctx = activeContext.get()
+  Option(activeContext.get()).filter(_ ne sc).foreach { ctx =>
   val errMsg = "Only one SparkContext may be running in this JVM (see 
SPARK-2243)." +
 " To ignore this error, set spark.driver.allowMultipleContexts = 
true. " +
 s"The currently running SparkContext was created 
at:\n${ctx.creationSite.longForm}"
@@ -2241,6 +2227,17 @@ object SparkContext extends Logging {
 throw exception
   }
 }
+
+  contextBeingConstructed.filter(_ ne sc).foreach { otherContext =>
+// Since otherContext might point to a partially-constructed context, 
guard against
+// its creationSite field being null:
+val otherContextCreationSite =
+  Option(otherContext.creationSite).map(_.longForm).getOrElse("unknown 
location")
+val warnMsg = "Another SparkContext is being constructed (or threw an 
exception in its" +
+  " constructor).  This may indicate an error, since only one 
SparkContext may be" +
+  " running in this JVM (see SPARK-2243)." +
+  s" The other SparkContext was created at:\n$otherContextCreationSite"
+logWarning(warnMsg)
   }
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/cdf9e975/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
--
diff --git a/core/src/test/scala/org/apache/spark/SparkContextSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
index 841fd02..a759f36 100644
--- a/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SparkContextSuite.scala
@@ -39,8 +39,12 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext {
 val conf = new SparkConf().setAppName("test").setMaster("local")
   .set("spark.driver.allowMultipleContexts", "false")
 sc = new SparkContext(conf)
+val envBefore = SparkEnv.get
 // A SparkContext is already running, so we shouldn't be able to create a 
second one
 intercept[SparkException] { new SparkContext(conf

spark git commit: Fix reference to external metrics documentation

spark git commit: Fix reference to external metrics documentation

spark git commit: [SPARK-15049] Rename NewAccumulator to AccumulatorV2

spark git commit: [SPARK-15049] Rename NewAccumulator to AccumulatorV2

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

[spark] Git Push Summary

spark git commit: [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update

spark git commit: [SPARK-14505][CORE] Fix bug : creating two SparkContext objects in the same jvm, the first one will can not run any task!

9 matches

Site Navigation

Mail list logo

Footer information