date:20170726

spark git commit: [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred.

2017-07-26 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 60472dbfd -> cfb25b27c


[SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred.

## What changes were proposed in this pull request?

Update the description of `spark.shuffle.maxChunksBeingTransferred` to include 
that the new coming connections will be closed when the max is hit and client 
should have retry mechanism.

Author: jinxing 

Closes #18735 from jinxing64/SPARK-21530.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cfb25b27
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cfb25b27
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cfb25b27

Branch: refs/heads/master
Commit: cfb25b27c0b32a8a70a518955fb269314b1fd716
Parents: 60472db
Author: jinxing 
Authored: Thu Jul 27 11:55:48 2017 +0800
Committer: Wenchen Fan 
Committed: Thu Jul 27 11:55:48 2017 +0800

--
 .../main/java/org/apache/spark/network/util/TransportConf.java | 6 +-
 docs/configuration.md  | 6 +-
 2 files changed, 10 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cfb25b27/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
--
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
 
b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
index ea52e9f..88256b8 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
@@ -258,7 +258,11 @@ public class TransportConf {
   }
 
   /**
-   * The max number of chunks allowed to being transferred at the same time on 
shuffle service.
+   * The max number of chunks allowed to be transferred at the same time on 
shuffle service.
+   * Note that new incoming connections will be closed when the max number is 
hit. The client will
+   * retry according to the shuffle retry configs (see 
`spark.shuffle.io.maxRetries` and
+   * `spark.shuffle.io.retryWait`), if those limits are reached the task will 
fail with fetch
+   * failure.
*/
   public long maxChunksBeingTransferred() {
 return conf.getLong("spark.shuffle.maxChunksBeingTransferred", 
Long.MAX_VALUE);

http://git-wip-us.apache.org/repos/asf/spark/blob/cfb25b27/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index f4b6f46..500f980 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -635,7 +635,11 @@ Apart from these, the following properties are also 
available, and may be useful
   spark.shuffle.maxChunksBeingTransferred
   Long.MAX_VALUE
   
-The max number of chunks allowed to being transferred at the same time on 
shuffle service.
+The max number of chunks allowed to be transferred at the same time on 
shuffle service.
+Note that new incoming connections will be closed when the max number is 
hit. The client will
+retry according to the shuffle retry configs (see 
spark.shuffle.io.maxRetries and
+spark.shuffle.io.retryWait), if those limits are reached the 
task will fail with
+fetch failure.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions

2017-07-26 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master cf29828d7 -> 60472dbfd


[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in 
functions

## What changes were proposed in this pull request?

This generates a documentation for Spark SQL built-in functions.

One drawback is, this requires a proper build to generate built-in function 
list.
Once it is built, it only takes few seconds by `sql/create-docs.sh`.

Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the 
output documentation.

There are few more works to be done in order to make the documentation pretty, 
for example, separating `Arguments:` and `Examples:` but I guess this should be 
done within `ExpressionDescription` and `ExpressionInfo` rather than manually 
parsing it. I will fix these in a follow up.

This requires `pip install mkdocs` to generate HTMLs from markdown files.

## How was this patch tested?

Manually tested:

```
cd docs
jekyll build
```
,

```
cd docs
jekyll serve
```

and

```
cd sql
create-docs.sh
```

Author: hyukjinkwon 

Closes #18702 from HyukjinKwon/SPARK-21485.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/60472dbf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/60472dbf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/60472dbf

Branch: refs/heads/master
Commit: 60472dbfd97acfd6c4420a13f9b32bc9d84219f3
Parents: cf29828
Author: hyukjinkwon 
Authored: Wed Jul 26 09:38:51 2017 -0700
Committer: Reynold Xin 
Committed: Wed Jul 26 09:38:51 2017 -0700

--
 .gitignore  |  2 +
 docs/README.md  |  6 +-
 docs/_layouts/global.html   |  1 +
 docs/_plugins/copy_api_dirs.rb  | 27 ++
 docs/api.md |  1 +
 docs/index.md   |  1 +
 sql/README.md   |  2 +
 .../spark/sql/api/python/PythonSQLUtils.scala   |  7 ++
 sql/create-docs.sh  | 49 +++
 sql/gen-sql-markdown.py | 91 
 sql/mkdocs.yml  | 19 
 11 files changed, 203 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/.gitignore
--
diff --git a/.gitignore b/.gitignore
index cf9780d..903297d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -47,6 +47,8 @@ dev/pr-deps/
 dist/
 docs/_site
 docs/api
+sql/docs
+sql/site
 lib_managed/
 lint-r-report.log
 log/

http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/docs/README.md
--
diff --git a/docs/README.md b/docs/README.md
index 90e10a1..0090dd0 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -68,6 +68,6 @@ jekyll plugin to run `build/sbt unidoc` before building the 
site so if you haven
 may take some time as it generates all of the scaladoc.  The jekyll plugin 
also generates the
 PySpark docs using [Sphinx](http://sphinx-doc.org/).
 
-NOTE: To skip the step of building and copying over the Scala, Python, R API 
docs, run `SKIP_API=1
-jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, and `SKIP_RDOC=1` 
can be used to skip a single
-step of the corresponding language.
+NOTE: To skip the step of building and copying over the Scala, Python, R and 
SQL API docs, run `SKIP_API=1
+jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and 
`SKIP_SQLDOC=1` can be used
+to skip a single step of the corresponding language.

http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/docs/_layouts/global.html
--
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 570483c..67b05ec 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -86,6 +86,7 @@
 Java
 Python
 R
+SQL, Built-in 
Functions
 
 
 

http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/docs/_plugins/copy_api_dirs.rb
--
diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb
index 95e3ba3..00366f8 100644
--- a/docs/_plugins/copy_api_dirs.rb
+++ b/docs/_plugins/copy_api_dirs.rb
@@ -150,4 +150,31 @@ if not (ENV['SKIP_API'] == '1')
 cp("../R/pkg/DESCRIPTION", "api")
   end
 
+  if not (ENV['SKIP_SQLDOC'] == '1')
+# Build SQL API docs
+
+puts "Moving to project root and building API docs."
+curr_dir = pwd
+

spark git commit: [SPARK-20988][ML] Logistic regression uses aggregator hierarchy

2017-07-26 Thread mlnick

Repository: spark
Updated Branches:
  refs/heads/master ae4ea5fe2 -> cf29828d7


[SPARK-20988][ML] Logistic regression uses aggregator hierarchy

## What changes were proposed in this pull request?

This change pulls the `LogisticAggregator` class out of 
LogisticRegression.scala and makes it extend `DifferentiableLossAggregator`. It 
also changes logistic regression to use the generic `RDDLossFunction` instead 
of having its own.

Other minor changes:
* L2Regularization accepts `Option[Int => Double]` for features standard 
deviation
* L2Regularization uses `Vector` type instead of Array
* Some tests added to LeastSquaresAggregator

## How was this patch tested?

Unit test suites are added.

Author: sethah 

Closes #18305 from sethah/SPARK-20988.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cf29828d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cf29828d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cf29828d

Branch: refs/heads/master
Commit: cf29828d720611f7633a8114917bac901f76dece
Parents: ae4ea5f
Author: sethah 
Authored: Wed Jul 26 13:38:53 2017 +0200
Committer: Nick Pentreath 
Committed: Wed Jul 26 13:38:53 2017 +0200

--
 .../ml/classification/LogisticRegression.scala  | 492 +--
 .../optim/aggregator/LogisticAggregator.scala   | 364 ++
 .../loss/DifferentiableRegularization.scala |  55 ++-
 .../spark/ml/optim/loss/RDDLossFunction.scala   |  10 +-
 .../spark/ml/regression/LinearRegression.scala  |   3 +-
 .../LogisticRegressionSuite.scala   |   9 +-
 .../DifferentiableLossAggregatorSuite.scala |  37 ++
 .../LeastSquaresAggregatorSuite.scala   |  47 +-
 .../aggregator/LogisticAggregatorSuite.scala| 253 ++
 .../DifferentiableRegularizationSuite.scala |  13 +-
 .../ml/optim/loss/RDDLossFunctionSuite.scala|   6 +-
 11 files changed, 752 insertions(+), 537 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cf29828d/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 65b09e5..6bba7f9 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -27,11 +27,11 @@ import org.apache.hadoop.fs.Path
 
 import org.apache.spark.SparkException
 import org.apache.spark.annotation.{Experimental, Since}
-import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.internal.Logging
 import org.apache.spark.ml.feature.Instance
 import org.apache.spark.ml.linalg._
-import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.optim.aggregator.LogisticAggregator
+import org.apache.spark.ml.optim.loss.{L2Regularization, RDDLossFunction}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
@@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") (
 val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, 
numClasses, $(fitIntercept),
+  multinomial = isMultinomial)(_)
+val getFeaturesStd = (j: Int) => if (j >= 0 && j < numCoefficientSets 
* numFeatures) {
+  featuresStd(j / numCoefficientSets)
+} else {
+  0.0
+}
+
+val regularization = if (regParamL2 != 0.0) {
+  val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * 
numCoefficientSets
+  Some(new L2Regularization(regParamL2, shouldApply,
+if ($(standardization)) None else Some(getFeaturesStd)))
+} else {
+  None
+}
+
+val costFun = new RDDLossFunction(instances, getAggregatorFunc, 
regularization,
   $(aggregationDepth))
 
 val numCoeffsPlusIntercepts = numFeaturesPlusIntercept * 
numCoefficientSets
@@ -1236,7 +1251,7 @@ object LogisticRegressionModel extends 
MLReadable[LogisticRegressionModel] {
  * Two MultilabelSummarizer can be merged together to have a statistical 
summary of the
  * corresponding joint dataset.
  */
-private[classification] class MultiClassSummarizer extends Serializable {
+private[ml] class MultiClassSummarizer extends Serializable {
   // The first element of value in distinctMap is the act

spark git commit: [SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generates wrong temp files

2017-07-26 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 16612638f -> ae4ea5fe2


[SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generates wrong 
temp files

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-21524

ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong 
place and does not delete them.

ValidatorParamsSuiteHelpers.testFileMove() is invoked by 
TrainValidationSplitSuite and crossValidatorSuite. Currently it uses `tempDir` 
from `TempDirectory`, which unfortunately is never initialized since the 
`boforeAll()` of `ValidatorParamsSuiteHelpers` is never invoked.

In my system, it leaves some temp directories in the assembly folder each time 
I run the TrainValidationSplitSuite and crossValidatorSuite.

## How was this patch tested?
unit test fix

Author: Yuhao Yang 

Closes #18728 from hhbyyh/tempDirFix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ae4ea5fe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ae4ea5fe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ae4ea5fe

Branch: refs/heads/master
Commit: ae4ea5fe253e9083e807bdc769e829f3f37265b4
Parents: 1661263
Author: Yuhao Yang 
Authored: Wed Jul 26 10:37:48 2017 +0100
Committer: Sean Owen 
Committed: Wed Jul 26 10:37:48 2017 +0100

--
 .../org/apache/spark/ml/tuning/CrossValidatorSuite.scala| 2 +-
 .../apache/spark/ml/tuning/TrainValidationSplitSuite.scala  | 2 +-
 .../spark/ml/tuning/ValidatorParamsSuiteHelpers.scala   | 9 +
 3 files changed, 7 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ae4ea5fe/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
index 2791ea7..dc6043e 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
@@ -222,7 +222,7 @@ class CrossValidatorSuite
   .setNumFolds(20)
   .setEstimatorParamMaps(paramMaps)
 
-ValidatorParamsSuiteHelpers.testFileMove(cv)
+ValidatorParamsSuiteHelpers.testFileMove(cv, tempDir)
   }
 
   test("read/write: CrossValidator with complex estimator") {

http://git-wip-us.apache.org/repos/asf/spark/blob/ae4ea5fe/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
index 71a1776..7c97865 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
@@ -209,7 +209,7 @@ class TrainValidationSplitSuite
   .setEstimatorParamMaps(paramMaps)
   .setSeed(42L)
 
-ValidatorParamsSuiteHelpers.testFileMove(tvs)
+ValidatorParamsSuiteHelpers.testFileMove(tvs, tempDir)
   }
 
   test("read/write: TrainValidationSplitModel") {

http://git-wip-us.apache.org/repos/asf/spark/blob/ae4ea5fe/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala
index 1df673c..eae1f5a 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala
@@ -20,11 +20,12 @@ package org.apache.spark.ml.tuning
 import java.io.File
 import java.nio.file.{Files, StandardCopyOption}
 
-import org.apache.spark.SparkFunSuite
+import org.scalatest.Assertions
+
 import org.apache.spark.ml.param.{ParamMap, ParamPair, Params}
-import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, MLReader, 
MLWritable}
+import org.apache.spark.ml.util.{Identifiable, MLReader, MLWritable}
 
-object ValidatorParamsSuiteHelpers extends SparkFunSuite with 
DefaultReadWriteTest {
+object ValidatorParamsSuiteHelpers extends Assertions {
   /**
* Assert sequences of estimatorParamMaps are identical.
* If the values for a parameter are not directly comparable with ===
@@ -62,7 +63,7 @@ object ValidatorParamsSuiteHelpers extends SparkFunSuite with 
DefaultReadWriteTe
* the path of the estimator so that if the pare

spark git commit: [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred.

spark git commit: [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions

spark git commit: [SPARK-20988][ML] Logistic regression uses aggregator hierarchy

spark git commit: [SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generates wrong temp files

4 matches

Site Navigation

Mail list logo

Footer information