spark git commit: [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred.
Repository: spark Updated Branches: refs/heads/master 60472dbfd -> cfb25b27c [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred. ## What changes were proposed in this pull request? Update the description of `spark.shuffle.maxChunksBeingTransferred` to include that the new coming connections will be closed when the max is hit and client should have retry mechanism. Author: jinxing Closes #18735 from jinxing64/SPARK-21530. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cfb25b27 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cfb25b27 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cfb25b27 Branch: refs/heads/master Commit: cfb25b27c0b32a8a70a518955fb269314b1fd716 Parents: 60472db Author: jinxing Authored: Thu Jul 27 11:55:48 2017 +0800 Committer: Wenchen Fan Committed: Thu Jul 27 11:55:48 2017 +0800 -- .../main/java/org/apache/spark/network/util/TransportConf.java | 6 +- docs/configuration.md | 6 +- 2 files changed, 10 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cfb25b27/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java -- diff --git a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java index ea52e9f..88256b8 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java +++ b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java @@ -258,7 +258,11 @@ public class TransportConf { } /** - * The max number of chunks allowed to being transferred at the same time on shuffle service. + * The max number of chunks allowed to be transferred at the same time on shuffle service. + * Note that new incoming connections will be closed when the max number is hit. The client will + * retry according to the shuffle retry configs (see `spark.shuffle.io.maxRetries` and + * `spark.shuffle.io.retryWait`), if those limits are reached the task will fail with fetch + * failure. */ public long maxChunksBeingTransferred() { return conf.getLong("spark.shuffle.maxChunksBeingTransferred", Long.MAX_VALUE); http://git-wip-us.apache.org/repos/asf/spark/blob/cfb25b27/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index f4b6f46..500f980 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -635,7 +635,11 @@ Apart from these, the following properties are also available, and may be useful spark.shuffle.maxChunksBeingTransferred Long.MAX_VALUE -The max number of chunks allowed to being transferred at the same time on shuffle service. +The max number of chunks allowed to be transferred at the same time on shuffle service. +Note that new incoming connections will be closed when the max number is hit. The client will +retry according to the shuffle retry configs (see spark.shuffle.io.maxRetries and +spark.shuffle.io.retryWait), if those limits are reached the task will fail with +fetch failure. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions
Repository: spark Updated Branches: refs/heads/master cf29828d7 -> 60472dbfd [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions ## What changes were proposed in this pull request? This generates a documentation for Spark SQL built-in functions. One drawback is, this requires a proper build to generate built-in function list. Once it is built, it only takes few seconds by `sql/create-docs.sh`. Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation. There are few more works to be done in order to make the documentation pretty, for example, separating `Arguments:` and `Examples:` but I guess this should be done within `ExpressionDescription` and `ExpressionInfo` rather than manually parsing it. I will fix these in a follow up. This requires `pip install mkdocs` to generate HTMLs from markdown files. ## How was this patch tested? Manually tested: ``` cd docs jekyll build ``` , ``` cd docs jekyll serve ``` and ``` cd sql create-docs.sh ``` Author: hyukjinkwon Closes #18702 from HyukjinKwon/SPARK-21485. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/60472dbf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/60472dbf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/60472dbf Branch: refs/heads/master Commit: 60472dbfd97acfd6c4420a13f9b32bc9d84219f3 Parents: cf29828 Author: hyukjinkwon Authored: Wed Jul 26 09:38:51 2017 -0700 Committer: Reynold Xin Committed: Wed Jul 26 09:38:51 2017 -0700 -- .gitignore | 2 + docs/README.md | 6 +- docs/_layouts/global.html | 1 + docs/_plugins/copy_api_dirs.rb | 27 ++ docs/api.md | 1 + docs/index.md | 1 + sql/README.md | 2 + .../spark/sql/api/python/PythonSQLUtils.scala | 7 ++ sql/create-docs.sh | 49 +++ sql/gen-sql-markdown.py | 91 sql/mkdocs.yml | 19 11 files changed, 203 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/.gitignore -- diff --git a/.gitignore b/.gitignore index cf9780d..903297d 100644 --- a/.gitignore +++ b/.gitignore @@ -47,6 +47,8 @@ dev/pr-deps/ dist/ docs/_site docs/api +sql/docs +sql/site lib_managed/ lint-r-report.log log/ http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/docs/README.md -- diff --git a/docs/README.md b/docs/README.md index 90e10a1..0090dd0 100644 --- a/docs/README.md +++ b/docs/README.md @@ -68,6 +68,6 @@ jekyll plugin to run `build/sbt unidoc` before building the site so if you haven may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [Sphinx](http://sphinx-doc.org/). -NOTE: To skip the step of building and copying over the Scala, Python, R API docs, run `SKIP_API=1 -jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, and `SKIP_RDOC=1` can be used to skip a single -step of the corresponding language. +NOTE: To skip the step of building and copying over the Scala, Python, R and SQL API docs, run `SKIP_API=1 +jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and `SKIP_SQLDOC=1` can be used +to skip a single step of the corresponding language. http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/docs/_layouts/global.html -- diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 570483c..67b05ec 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -86,6 +86,7 @@ Java Python R +SQL, Built-in Functions http://git-wip-us.apache.org/repos/asf/spark/blob/60472dbf/docs/_plugins/copy_api_dirs.rb -- diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb index 95e3ba3..00366f8 100644 --- a/docs/_plugins/copy_api_dirs.rb +++ b/docs/_plugins/copy_api_dirs.rb @@ -150,4 +150,31 @@ if not (ENV['SKIP_API'] == '1') cp("../R/pkg/DESCRIPTION", "api") end + if not (ENV['SKIP_SQLDOC'] == '1') +# Build SQL API docs + +puts "Moving to project root and building API docs." +curr_dir = pwd +
spark git commit: [SPARK-20988][ML] Logistic regression uses aggregator hierarchy
Repository: spark Updated Branches: refs/heads/master ae4ea5fe2 -> cf29828d7 [SPARK-20988][ML] Logistic regression uses aggregator hierarchy ## What changes were proposed in this pull request? This change pulls the `LogisticAggregator` class out of LogisticRegression.scala and makes it extend `DifferentiableLossAggregator`. It also changes logistic regression to use the generic `RDDLossFunction` instead of having its own. Other minor changes: * L2Regularization accepts `Option[Int => Double]` for features standard deviation * L2Regularization uses `Vector` type instead of Array * Some tests added to LeastSquaresAggregator ## How was this patch tested? Unit test suites are added. Author: sethah Closes #18305 from sethah/SPARK-20988. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cf29828d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cf29828d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cf29828d Branch: refs/heads/master Commit: cf29828d720611f7633a8114917bac901f76dece Parents: ae4ea5f Author: sethah Authored: Wed Jul 26 13:38:53 2017 +0200 Committer: Nick Pentreath Committed: Wed Jul 26 13:38:53 2017 +0200 -- .../ml/classification/LogisticRegression.scala | 492 +-- .../optim/aggregator/LogisticAggregator.scala | 364 ++ .../loss/DifferentiableRegularization.scala | 55 ++- .../spark/ml/optim/loss/RDDLossFunction.scala | 10 +- .../spark/ml/regression/LinearRegression.scala | 3 +- .../LogisticRegressionSuite.scala | 9 +- .../DifferentiableLossAggregatorSuite.scala | 37 ++ .../LeastSquaresAggregatorSuite.scala | 47 +- .../aggregator/LogisticAggregatorSuite.scala| 253 ++ .../DifferentiableRegularizationSuite.scala | 13 +- .../ml/optim/loss/RDDLossFunctionSuite.scala| 6 +- 11 files changed, 752 insertions(+), 537 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cf29828d/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index 65b09e5..6bba7f9 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -27,11 +27,11 @@ import org.apache.hadoop.fs.Path import org.apache.spark.SparkException import org.apache.spark.annotation.{Experimental, Since} -import org.apache.spark.broadcast.Broadcast import org.apache.spark.internal.Logging import org.apache.spark.ml.feature.Instance import org.apache.spark.ml.linalg._ -import org.apache.spark.ml.linalg.BLAS._ +import org.apache.spark.ml.optim.aggregator.LogisticAggregator +import org.apache.spark.ml.optim.loss.{L2Regularization, RDDLossFunction} import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util._ @@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") ( val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) val bcFeaturesStd = instances.context.broadcast(featuresStd) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, +val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, numClasses, $(fitIntercept), + multinomial = isMultinomial)(_) +val getFeaturesStd = (j: Int) => if (j >= 0 && j < numCoefficientSets * numFeatures) { + featuresStd(j / numCoefficientSets) +} else { + 0.0 +} + +val regularization = if (regParamL2 != 0.0) { + val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * numCoefficientSets + Some(new L2Regularization(regParamL2, shouldApply, +if ($(standardization)) None else Some(getFeaturesStd))) +} else { + None +} + +val costFun = new RDDLossFunction(instances, getAggregatorFunc, regularization, $(aggregationDepth)) val numCoeffsPlusIntercepts = numFeaturesPlusIntercept * numCoefficientSets @@ -1236,7 +1251,7 @@ object LogisticRegressionModel extends MLReadable[LogisticRegressionModel] { * Two MultilabelSummarizer can be merged together to have a statistical summary of the * corresponding joint dataset. */ -private[classification] class MultiClassSummarizer extends Serializable { +private[ml] class MultiClassSummarizer extends Serializable { // The first element of value in distinctMap is the act
spark git commit: [SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generates wrong temp files
Repository: spark Updated Branches: refs/heads/master 16612638f -> ae4ea5fe2 [SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generates wrong temp files ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-21524 ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong place and does not delete them. ValidatorParamsSuiteHelpers.testFileMove() is invoked by TrainValidationSplitSuite and crossValidatorSuite. Currently it uses `tempDir` from `TempDirectory`, which unfortunately is never initialized since the `boforeAll()` of `ValidatorParamsSuiteHelpers` is never invoked. In my system, it leaves some temp directories in the assembly folder each time I run the TrainValidationSplitSuite and crossValidatorSuite. ## How was this patch tested? unit test fix Author: Yuhao Yang Closes #18728 from hhbyyh/tempDirFix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ae4ea5fe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ae4ea5fe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ae4ea5fe Branch: refs/heads/master Commit: ae4ea5fe253e9083e807bdc769e829f3f37265b4 Parents: 1661263 Author: Yuhao Yang Authored: Wed Jul 26 10:37:48 2017 +0100 Committer: Sean Owen Committed: Wed Jul 26 10:37:48 2017 +0100 -- .../org/apache/spark/ml/tuning/CrossValidatorSuite.scala| 2 +- .../apache/spark/ml/tuning/TrainValidationSplitSuite.scala | 2 +- .../spark/ml/tuning/ValidatorParamsSuiteHelpers.scala | 9 + 3 files changed, 7 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ae4ea5fe/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala index 2791ea7..dc6043e 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala @@ -222,7 +222,7 @@ class CrossValidatorSuite .setNumFolds(20) .setEstimatorParamMaps(paramMaps) -ValidatorParamsSuiteHelpers.testFileMove(cv) +ValidatorParamsSuiteHelpers.testFileMove(cv, tempDir) } test("read/write: CrossValidator with complex estimator") { http://git-wip-us.apache.org/repos/asf/spark/blob/ae4ea5fe/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala index 71a1776..7c97865 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala @@ -209,7 +209,7 @@ class TrainValidationSplitSuite .setEstimatorParamMaps(paramMaps) .setSeed(42L) -ValidatorParamsSuiteHelpers.testFileMove(tvs) +ValidatorParamsSuiteHelpers.testFileMove(tvs, tempDir) } test("read/write: TrainValidationSplitModel") { http://git-wip-us.apache.org/repos/asf/spark/blob/ae4ea5fe/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala b/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala index 1df673c..eae1f5a 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala @@ -20,11 +20,12 @@ package org.apache.spark.ml.tuning import java.io.File import java.nio.file.{Files, StandardCopyOption} -import org.apache.spark.SparkFunSuite +import org.scalatest.Assertions + import org.apache.spark.ml.param.{ParamMap, ParamPair, Params} -import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, MLReader, MLWritable} +import org.apache.spark.ml.util.{Identifiable, MLReader, MLWritable} -object ValidatorParamsSuiteHelpers extends SparkFunSuite with DefaultReadWriteTest { +object ValidatorParamsSuiteHelpers extends Assertions { /** * Assert sequences of estimatorParamMaps are identical. * If the values for a parameter are not directly comparable with === @@ -62,7 +63,7 @@ object ValidatorParamsSuiteHelpers extends SparkFunSuite with DefaultReadWriteTe * the path of the estimator so that if the pare