[2/2] spark git commit: Update docs/README.md to put all prereqs together.
Update docs/README.md to put all prereqs together. This pull request groups all the prereq requirements into a single section. cc srowen shivaram Author: Reynold Xin Closes #7951 from rxin/readme-docs and squashes the following commits: ab7ded0 [Reynold Xin] Updated docs/README.md to put all prereqs together. (cherry picked from commit f7abd6bec9d51ed4ab6359e50eac853e64ecae86) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b6e8446a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b6e8446a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b6e8446a Branch: refs/heads/branch-1.5 Commit: b6e8446a421013e4679f5f4d95ffa6f166a9b0b7 Parents: 141f034 Author: Reynold Xin Authored: Tue Aug 4 22:17:14 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 23:22:11 2015 -0700 -- docs/README.md | 43 --- 1 file changed, 8 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b6e8446a/docs/README.md -- diff --git a/docs/README.md b/docs/README.md index 5020989..1f4fd3e 100644 --- a/docs/README.md +++ b/docs/README.md @@ -9,12 +9,13 @@ documentation yourself. Why build it yourself? So that you have the docs that co whichever version of Spark you currently have checked out of revision control. ## Prerequisites -The Spark documenation build uses a number of tools to build HTML docs and API docs in Scala, Python -and R. To get started you can run the following commands +The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, +Python and R. To get started you can run the following commands $ sudo gem install jekyll $ sudo gem install jekyll-redirect-from $ sudo pip install Pygments +$ sudo pip install sphinx $ Rscript -e 'install.packages(c("knitr", "devtools"), repos="http://cran.stat.ucla.edu/";)' @@ -29,17 +30,12 @@ you have checked out or downloaded. In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md. -The markdown code can be compiled to HTML using the [Jekyll tool](http://jekyllrb.com). -`Jekyll` and a few dependencies must be installed for this to work. We recommend -installing via the Ruby Gem dependency manager. Since the exact HTML output -varies between versions of Jekyll and its dependencies, we list specific versions here -in some cases: +Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with +Jekyll will create a directory called `_site` containing index.html as well as the rest of the +compiled files. -$ sudo gem install jekyll -$ sudo gem install jekyll-redirect-from - -Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with Jekyll will create a directory -called `_site` containing index.html as well as the rest of the compiled files. +$ cd docs +$ jekyll build You can modify the default Jekyll build as follows: @@ -50,29 +46,6 @@ You can modify the default Jekyll build as follows: # Build the site with extra features used on the live page $ PRODUCTION=1 jekyll build -## Pygments - -We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages, -so you will also need to install that (it requires Python) by running `sudo pip install Pygments`. - -To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile -phase, use the following sytax: - -{% highlight scala %} -// Your scala code goes here, you can replace scala with many other -// supported languages too. -{% endhighlight %} - -## Sphinx - -We use Sphinx to generate Python API docs, so you will need to install it by running -`sudo pip install sphinx`. - -## knitr, devtools - -SparkR documentation is written using `roxygen2` and we use `knitr`, `devtools` to generate -documentation. To install these packages you can run `install.packages(c("knitr", "devtools"))` from a -R console. ## API Docs (Scaladoc, Sphinx, roxygen2) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/2] spark git commit: Add a prerequisites section for building docs
Repository: spark Updated Branches: refs/heads/branch-1.5 864d5de6d -> b6e8446a4 Add a prerequisites section for building docs This puts all the install commands that need to be run in one section instead of being spread over many paragraphs cc rxin Author: Shivaram Venkataraman Closes #7912 from shivaram/docs-setup-readme and squashes the following commits: cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs (cherry picked from commit 7abaaad5b169520fbf7299808b2bafde089a16a2) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/141f034b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/141f034b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/141f034b Branch: refs/heads/branch-1.5 Commit: 141f034b4575b101c4f23d4dd5b2a95e14a632ef Parents: 864d5de Author: Shivaram Venkataraman Authored: Mon Aug 3 17:00:59 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 23:22:02 2015 -0700 -- docs/README.md | 10 ++ 1 file changed, 10 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/141f034b/docs/README.md -- diff --git a/docs/README.md b/docs/README.md index d7652e9..5020989 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,6 +8,16 @@ Read on to learn more about viewing documentation in plain text (i.e., markdown) documentation yourself. Why build it yourself? So that you have the docs that corresponds to whichever version of Spark you currently have checked out of revision control. +## Prerequisites +The Spark documenation build uses a number of tools to build HTML docs and API docs in Scala, Python +and R. To get started you can run the following commands + +$ sudo gem install jekyll +$ sudo gem install jekyll-redirect-from +$ sudo pip install Pygments +$ Rscript -e 'install.packages(c("knitr", "devtools"), repos="http://cran.stat.ucla.edu/";)' + + ## Generating the Documentation HTML We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType
Repository: spark Updated Branches: refs/heads/master d34548587 -> 781c8d71a [SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType Let Decimal carry the correct precision and scale with DecimalType. cc rxin yhuai Author: Davies Liu Closes #7925 from davies/decimal_scale and squashes the following commits: e19701a [Davies Liu] some tweaks 57d78d2 [Davies Liu] fix tests 5d5bc69 [Davies Liu] match precision and scale with DecimalType Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/781c8d71 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/781c8d71 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/781c8d71 Branch: refs/heads/master Commit: 781c8d71a0a6a86c84048a4f22cb3a7d035a5be2 Parents: d345485 Author: Davies Liu Authored: Tue Aug 4 23:12:49 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 23:12:49 2015 -0700 -- .../main/scala/org/apache/spark/sql/Row.scala | 4 +++ .../sql/catalyst/CatalystTypeConverters.scala | 21 ++- .../catalyst/analysis/HiveTypeCoercion.scala| 4 +-- .../spark/sql/catalyst/expressions/Cast.scala | 6 ++-- .../sql/catalyst/expressions/arithmetic.scala | 2 +- .../org/apache/spark/sql/types/Decimal.scala| 37 .../spark/sql/types/decimal/DecimalSuite.scala | 21 +++ .../sql/execution/SparkSqlSerializer2.scala | 3 +- .../apache/spark/sql/execution/pythonUDFs.scala | 3 +- .../org/apache/spark/sql/json/InferSchema.scala | 21 --- .../apache/spark/sql/json/JacksonParser.scala | 5 ++- .../apache/spark/sql/JavaApplySchemaSuite.java | 2 +- .../columnar/InMemoryColumnarQuerySuite.scala | 2 +- .../org/apache/spark/sql/json/JsonSuite.scala | 26 ++ .../spark/sql/parquet/ParquetQuerySuite.scala | 13 +++ .../hive/execution/ScriptTransformation.scala | 2 +- 16 files changed, 122 insertions(+), 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/781c8d71/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala index 9144947..40159aa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala @@ -417,6 +417,10 @@ trait Row extends Serializable { if (!o2.isInstanceOf[Double] || ! java.lang.Double.isNaN(o2.asInstanceOf[Double])) { return false } + case d1: java.math.BigDecimal if o2.isInstanceOf[java.math.BigDecimal] => +if (d1.compareTo(o2.asInstanceOf[java.math.BigDecimal]) != 0) { + return false +} case _ => if (o1 != o2) { return false } http://git-wip-us.apache.org/repos/asf/spark/blob/781c8d71/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala index c666864..8d0c64e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala @@ -317,18 +317,23 @@ object CatalystTypeConverters { private class DecimalConverter(dataType: DecimalType) extends CatalystTypeConverter[Any, JavaBigDecimal, Decimal] { -override def toCatalystImpl(scalaValue: Any): Decimal = scalaValue match { - case d: BigDecimal => Decimal(d) - case d: JavaBigDecimal => Decimal(d) - case d: Decimal => d +override def toCatalystImpl(scalaValue: Any): Decimal = { + val decimal = scalaValue match { +case d: BigDecimal => Decimal(d) +case d: JavaBigDecimal => Decimal(d) +case d: Decimal => d + } + if (decimal.changePrecision(dataType.precision, dataType.scale)) { +decimal + } else { +null + } } override def toScala(catalystValue: Decimal): JavaBigDecimal = catalystValue.toJavaBigDecimal override def toScalaImpl(row: InternalRow, column: Int): JavaBigDecimal = row.getDecimal(column, dataType.precision, dataType.scale).toJavaBigDecimal } - private object BigDecimalConverter extends DecimalConverter(DecimalType.SYSTEM_DEFAULT) - private abstract class PrimitiveConverter[T] extends CatalystTypeConverter[T, Any, Any] { final override def toScala(catalystValue: Any): Any = catalystValue final over
spark git commit: [SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType
Repository: spark Updated Branches: refs/heads/branch-1.5 28bb97730 -> 864d5de6d [SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType Let Decimal carry the correct precision and scale with DecimalType. cc rxin yhuai Author: Davies Liu Closes #7925 from davies/decimal_scale and squashes the following commits: e19701a [Davies Liu] some tweaks 57d78d2 [Davies Liu] fix tests 5d5bc69 [Davies Liu] match precision and scale with DecimalType (cherry picked from commit 781c8d71a0a6a86c84048a4f22cb3a7d035a5be2) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/864d5de6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/864d5de6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/864d5de6 Branch: refs/heads/branch-1.5 Commit: 864d5de6da3110974ddf0fbd216e3ef934a9f034 Parents: 28bb977 Author: Davies Liu Authored: Tue Aug 4 23:12:49 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 23:13:03 2015 -0700 -- .../main/scala/org/apache/spark/sql/Row.scala | 4 +++ .../sql/catalyst/CatalystTypeConverters.scala | 21 ++- .../catalyst/analysis/HiveTypeCoercion.scala| 4 +-- .../spark/sql/catalyst/expressions/Cast.scala | 6 ++-- .../sql/catalyst/expressions/arithmetic.scala | 2 +- .../org/apache/spark/sql/types/Decimal.scala| 37 .../spark/sql/types/decimal/DecimalSuite.scala | 21 +++ .../sql/execution/SparkSqlSerializer2.scala | 3 +- .../apache/spark/sql/execution/pythonUDFs.scala | 3 +- .../org/apache/spark/sql/json/InferSchema.scala | 21 --- .../apache/spark/sql/json/JacksonParser.scala | 5 ++- .../apache/spark/sql/JavaApplySchemaSuite.java | 2 +- .../columnar/InMemoryColumnarQuerySuite.scala | 2 +- .../org/apache/spark/sql/json/JsonSuite.scala | 26 ++ .../spark/sql/parquet/ParquetQuerySuite.scala | 13 +++ .../hive/execution/ScriptTransformation.scala | 2 +- 16 files changed, 122 insertions(+), 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/864d5de6/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala index 9144947..40159aa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala @@ -417,6 +417,10 @@ trait Row extends Serializable { if (!o2.isInstanceOf[Double] || ! java.lang.Double.isNaN(o2.asInstanceOf[Double])) { return false } + case d1: java.math.BigDecimal if o2.isInstanceOf[java.math.BigDecimal] => +if (d1.compareTo(o2.asInstanceOf[java.math.BigDecimal]) != 0) { + return false +} case _ => if (o1 != o2) { return false } http://git-wip-us.apache.org/repos/asf/spark/blob/864d5de6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala index c666864..8d0c64e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala @@ -317,18 +317,23 @@ object CatalystTypeConverters { private class DecimalConverter(dataType: DecimalType) extends CatalystTypeConverter[Any, JavaBigDecimal, Decimal] { -override def toCatalystImpl(scalaValue: Any): Decimal = scalaValue match { - case d: BigDecimal => Decimal(d) - case d: JavaBigDecimal => Decimal(d) - case d: Decimal => d +override def toCatalystImpl(scalaValue: Any): Decimal = { + val decimal = scalaValue match { +case d: BigDecimal => Decimal(d) +case d: JavaBigDecimal => Decimal(d) +case d: Decimal => d + } + if (decimal.changePrecision(dataType.precision, dataType.scale)) { +decimal + } else { +null + } } override def toScala(catalystValue: Decimal): JavaBigDecimal = catalystValue.toJavaBigDecimal override def toScalaImpl(row: InternalRow, column: Int): JavaBigDecimal = row.getDecimal(column, dataType.precision, dataType.scale).toJavaBigDecimal } - private object BigDecimalConverter extends DecimalConverter(DecimalType.SYSTEM_DEFAULT) - private abstract class PrimitiveConverter[T] extends CatalystTypeConverte
spark git commit: [SPARK-8231] [SQL] Add array_contains
Repository: spark Updated Branches: refs/heads/branch-1.5 bca196754 -> 28bb97730 [SPARK-8231] [SQL] Add array_contains This PR is based on #7580 , thanks to EntilZha PR for work on https://issues.apache.org/jira/browse/SPARK-8231 Currently, I have an initial implementation for contains. Based on discussion on JIRA, it should behave same as Hive: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArrayContains.java#L102-L128 Main points are: 1. If the array is empty, null, or the value is null, return false 2. If there is a type mismatch, throw error 3. If comparison is not supported, throw error Closes #7580 Author: Pedro Rodriguez Author: Pedro Rodriguez Author: Davies Liu Closes #7949 from davies/array_contains and squashes the following commits: d3c08bc [Davies Liu] use foreach() to avoid copy bc3d1fe [Davies Liu] fix array_contains 719e37d [Davies Liu] Merge branch 'master' of github.com:apache/spark into array_contains e352cf9 [Pedro Rodriguez] fixed diff from master 4d5b0ff [Pedro Rodriguez] added docs and another type check ffc0591 [Pedro Rodriguez] fixed unit test 7a22deb [Pedro Rodriguez] Changed test to use strings instead of long/ints which are different between python 2 an 3 b5ffae8 [Pedro Rodriguez] fixed pyspark test 4e7dce3 [Pedro Rodriguez] added more docs 3082399 [Pedro Rodriguez] fixed unit test 46f9789 [Pedro Rodriguez] reverted change d3ca013 [Pedro Rodriguez] Fixed type checking to match hive behavior, then added tests to insure this 8528027 [Pedro Rodriguez] added more tests 686e029 [Pedro Rodriguez] fix scala style d262e9d [Pedro Rodriguez] reworked type checking code and added more tests 2517a58 [Pedro Rodriguez] removed unused import 28b4f71 [Pedro Rodriguez] fixed bug with type conversions and re-added tests 12f8795 [Pedro Rodriguez] fix scala style checks e8a20a9 [Pedro Rodriguez] added python df (broken atm) 65b562c [Pedro Rodriguez] made array_contains nullable false 33b45aa [Pedro Rodriguez] reordered test 9623c64 [Pedro Rodriguez] fixed test 4b4425b [Pedro Rodriguez] changed Arrays in tests to Seqs 72cb4b1 [Pedro Rodriguez] added checkInputTypes and docs 69c46fb [Pedro Rodriguez] added tests and codegen 9e0bfc4 [Pedro Rodriguez] initial attempt at implementation Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/28bb9773 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/28bb9773 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/28bb9773 Branch: refs/heads/branch-1.5 Commit: 28bb977302ff3077c82bb8ee7518eb36bddaf2b3 Parents: bca1967 Author: Pedro Rodriguez Authored: Tue Aug 4 22:32:21 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 22:35:45 2015 -0700 -- python/pyspark/sql/functions.py | 17 + .../catalyst/analysis/FunctionRegistry.scala| 1 + .../expressions/collectionOperations.scala | 78 +++- .../expressions/CollectionFunctionsSuite.scala | 15 .../scala/org/apache/spark/sql/functions.scala | 8 ++ .../spark/sql/DataFrameFunctionsSuite.scala | 49 +++- 6 files changed, 163 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/28bb9773/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index e65b14d..9f0d71d 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -1311,6 +1311,23 @@ def array(*cols): return Column(jc) +@since(1.5) +def array_contains(col, value): +""" +Collection function: returns True if the array contains the given value. The collection +elements and value must be of the same type. + +:param col: name of column containing array +:param value: value to check for in array + +>>> df = sqlContext.createDataFrame([(["a", "b", "c"],), ([],)], ['data']) +>>> df.select(array_contains(df.data, "a")).collect() +[Row(array_contains(data,a)=True), Row(array_contains(data,a)=False)] +""" +sc = SparkContext._active_spark_context +return Column(sc._jvm.functions.array_contains(_to_java_column(col), value)) + + @since(1.4) def explode(col): """Returns a new row for each element in the given array or map. http://git-wip-us.apache.org/repos/asf/spark/blob/28bb9773/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index 43e3e9b..94c355f 100644 --- a/sql/catalys
spark git commit: [SPARK-8231] [SQL] Add array_contains
Repository: spark Updated Branches: refs/heads/master a02bcf20c -> d34548587 [SPARK-8231] [SQL] Add array_contains This PR is based on #7580 , thanks to EntilZha PR for work on https://issues.apache.org/jira/browse/SPARK-8231 Currently, I have an initial implementation for contains. Based on discussion on JIRA, it should behave same as Hive: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArrayContains.java#L102-L128 Main points are: 1. If the array is empty, null, or the value is null, return false 2. If there is a type mismatch, throw error 3. If comparison is not supported, throw error Closes #7580 Author: Pedro Rodriguez Author: Pedro Rodriguez Author: Davies Liu Closes #7949 from davies/array_contains and squashes the following commits: d3c08bc [Davies Liu] use foreach() to avoid copy bc3d1fe [Davies Liu] fix array_contains 719e37d [Davies Liu] Merge branch 'master' of github.com:apache/spark into array_contains e352cf9 [Pedro Rodriguez] fixed diff from master 4d5b0ff [Pedro Rodriguez] added docs and another type check ffc0591 [Pedro Rodriguez] fixed unit test 7a22deb [Pedro Rodriguez] Changed test to use strings instead of long/ints which are different between python 2 an 3 b5ffae8 [Pedro Rodriguez] fixed pyspark test 4e7dce3 [Pedro Rodriguez] added more docs 3082399 [Pedro Rodriguez] fixed unit test 46f9789 [Pedro Rodriguez] reverted change d3ca013 [Pedro Rodriguez] Fixed type checking to match hive behavior, then added tests to insure this 8528027 [Pedro Rodriguez] added more tests 686e029 [Pedro Rodriguez] fix scala style d262e9d [Pedro Rodriguez] reworked type checking code and added more tests 2517a58 [Pedro Rodriguez] removed unused import 28b4f71 [Pedro Rodriguez] fixed bug with type conversions and re-added tests 12f8795 [Pedro Rodriguez] fix scala style checks e8a20a9 [Pedro Rodriguez] added python df (broken atm) 65b562c [Pedro Rodriguez] made array_contains nullable false 33b45aa [Pedro Rodriguez] reordered test 9623c64 [Pedro Rodriguez] fixed test 4b4425b [Pedro Rodriguez] changed Arrays in tests to Seqs 72cb4b1 [Pedro Rodriguez] added checkInputTypes and docs 69c46fb [Pedro Rodriguez] added tests and codegen 9e0bfc4 [Pedro Rodriguez] initial attempt at implementation Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3454858 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3454858 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3454858 Branch: refs/heads/master Commit: d34548587ab55bc2136c8f823b9e6ae96e1355a4 Parents: a02bcf2 Author: Pedro Rodriguez Authored: Tue Aug 4 22:32:21 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 22:34:02 2015 -0700 -- python/pyspark/sql/functions.py | 17 + .../catalyst/analysis/FunctionRegistry.scala| 1 + .../expressions/collectionOperations.scala | 78 +++- .../expressions/CollectionFunctionsSuite.scala | 15 .../scala/org/apache/spark/sql/functions.scala | 8 ++ .../spark/sql/DataFrameFunctionsSuite.scala | 49 +++- 6 files changed, 163 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d3454858/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index e65b14d..9f0d71d 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -1311,6 +1311,23 @@ def array(*cols): return Column(jc) +@since(1.5) +def array_contains(col, value): +""" +Collection function: returns True if the array contains the given value. The collection +elements and value must be of the same type. + +:param col: name of column containing array +:param value: value to check for in array + +>>> df = sqlContext.createDataFrame([(["a", "b", "c"],), ([],)], ['data']) +>>> df.select(array_contains(df.data, "a")).collect() +[Row(array_contains(data,a)=True), Row(array_contains(data,a)=False)] +""" +sc = SparkContext._active_spark_context +return Column(sc._jvm.functions.array_contains(_to_java_column(col), value)) + + @since(1.4) def explode(col): """Returns a new row for each element in the given array or map. http://git-wip-us.apache.org/repos/asf/spark/blob/d3454858/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index 43e3e9b..94c355f 100644 --- a/sql/catalyst/src/ma
spark git commit: [SPARK-9540] [MLLIB] optimize PrefixSpan implementation
Repository: spark Updated Branches: refs/heads/branch-1.5 6e72d24e2 -> bca196754 [SPARK-9540] [MLLIB] optimize PrefixSpan implementation This is a major refactoring of the PrefixSpan implementation. It contains the following changes: 1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large. 2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`. 3. Remember the start indices of all partial projections in the projected postfix to help next projection. 4. Reuse the original sequence array for projected postfixes. 5. Use `Prefix` IDs in aggregation rather than its content. 6. Use `ArrayBuilder` for building primitive arrays. 7. Expose `maxLocalProjDBSize`. 8. Tests are not changed except using `0` instead of `-1` as the delimiter. `Postfix`'s API doc should be a good place to start. Closes #7594 feynmanliang zhangjiajin Author: Xiangrui Meng Closes #7937 from mengxr/SPARK-9540 and squashes the following commits: 2d0ec31 [Xiangrui Meng] address more comments 48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test 65f90e8 [Xiangrui Meng] naming and documentation 8afc86a [Xiangrui Meng] refactor impl (cherry picked from commit a02bcf20c4fc9e2e182630d197221729e996afc2) Signed-off-by: Xiangrui Meng Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bca19675 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bca19675 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bca19675 Branch: refs/heads/branch-1.5 Commit: bca196754ddf2ccd057d775bd5c3f7d3e5657e6f Parents: 6e72d24 Author: Xiangrui Meng Authored: Tue Aug 4 22:28:49 2015 -0700 Committer: Xiangrui Meng Committed: Tue Aug 4 22:28:58 2015 -0700 -- .../spark/mllib/fpm/LocalPrefixSpan.scala | 132 +++-- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 587 --- .../spark/mllib/fpm/PrefixSpanSuite.scala | 271 + 3 files changed, 599 insertions(+), 391 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bca19675/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala index ccebf95..3ea1077 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala @@ -22,85 +22,89 @@ import scala.collection.mutable import org.apache.spark.Logging /** - * Calculate all patterns of a projected database in local. + * Calculate all patterns of a projected database in local mode. + * + * @param minCount minimal count for a frequent pattern + * @param maxPatternLength max pattern length for a frequent pattern */ -private[fpm] object LocalPrefixSpan extends Logging with Serializable { - import PrefixSpan._ +private[fpm] class LocalPrefixSpan( +val minCount: Long, +val maxPatternLength: Int) extends Logging with Serializable { + import PrefixSpan.Postfix + import LocalPrefixSpan.ReversedPrefix + /** - * Calculate all patterns of a projected database. - * @param minCount minimum count - * @param maxPatternLength maximum pattern length - * @param prefixes prefixes in reversed order - * @param database the projected database - * @return a set of sequential pattern pairs, - * the key of pair is sequential pattern (a list of items in reversed order), - * the value of pair is the pattern's count. + * Generates frequent patterns on the input array of postfixes. + * @param postfixes an array of postfixes + * @return an iterator of (frequent pattern, count) */ - def run( - minCount: Long, - maxPatternLength: Int, - prefixes: List[Set[Int]], - database: Iterable[List[Set[Int]]]): Iterator[(List[Set[Int]], Long)] = { -if (prefixes.length == maxPatternLength || database.isEmpty) { - return Iterator.empty -} -val freqItemSetsAndCounts = getFreqItemAndCounts(minCount, database) -val freqItems = freqItemSetsAndCounts.keys.flatten.toSet -val filteredDatabase = database.map { suffix => - suffix -.map(item => freqItems.intersect(item)) -.filter(_.nonEmpty) -} -freqItemSetsAndCounts.iterator.flatMap { case (item, count) => - val newPrefixes = item :: prefixes - val newProjected = project(filteredDatabase, item) - Iterator.single((new
spark git commit: [SPARK-9540] [MLLIB] optimize PrefixSpan implementation
Repository: spark Updated Branches: refs/heads/master f7abd6bec -> a02bcf20c [SPARK-9540] [MLLIB] optimize PrefixSpan implementation This is a major refactoring of the PrefixSpan implementation. It contains the following changes: 1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large. 2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`. 3. Remember the start indices of all partial projections in the projected postfix to help next projection. 4. Reuse the original sequence array for projected postfixes. 5. Use `Prefix` IDs in aggregation rather than its content. 6. Use `ArrayBuilder` for building primitive arrays. 7. Expose `maxLocalProjDBSize`. 8. Tests are not changed except using `0` instead of `-1` as the delimiter. `Postfix`'s API doc should be a good place to start. Closes #7594 feynmanliang zhangjiajin Author: Xiangrui Meng Closes #7937 from mengxr/SPARK-9540 and squashes the following commits: 2d0ec31 [Xiangrui Meng] address more comments 48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test 65f90e8 [Xiangrui Meng] naming and documentation 8afc86a [Xiangrui Meng] refactor impl Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a02bcf20 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a02bcf20 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a02bcf20 Branch: refs/heads/master Commit: a02bcf20c4fc9e2e182630d197221729e996afc2 Parents: f7abd6b Author: Xiangrui Meng Authored: Tue Aug 4 22:28:49 2015 -0700 Committer: Xiangrui Meng Committed: Tue Aug 4 22:28:49 2015 -0700 -- .../spark/mllib/fpm/LocalPrefixSpan.scala | 132 +++-- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 587 --- .../spark/mllib/fpm/PrefixSpanSuite.scala | 271 + 3 files changed, 599 insertions(+), 391 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a02bcf20/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala index ccebf95..3ea1077 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala @@ -22,85 +22,89 @@ import scala.collection.mutable import org.apache.spark.Logging /** - * Calculate all patterns of a projected database in local. + * Calculate all patterns of a projected database in local mode. + * + * @param minCount minimal count for a frequent pattern + * @param maxPatternLength max pattern length for a frequent pattern */ -private[fpm] object LocalPrefixSpan extends Logging with Serializable { - import PrefixSpan._ +private[fpm] class LocalPrefixSpan( +val minCount: Long, +val maxPatternLength: Int) extends Logging with Serializable { + import PrefixSpan.Postfix + import LocalPrefixSpan.ReversedPrefix + /** - * Calculate all patterns of a projected database. - * @param minCount minimum count - * @param maxPatternLength maximum pattern length - * @param prefixes prefixes in reversed order - * @param database the projected database - * @return a set of sequential pattern pairs, - * the key of pair is sequential pattern (a list of items in reversed order), - * the value of pair is the pattern's count. + * Generates frequent patterns on the input array of postfixes. + * @param postfixes an array of postfixes + * @return an iterator of (frequent pattern, count) */ - def run( - minCount: Long, - maxPatternLength: Int, - prefixes: List[Set[Int]], - database: Iterable[List[Set[Int]]]): Iterator[(List[Set[Int]], Long)] = { -if (prefixes.length == maxPatternLength || database.isEmpty) { - return Iterator.empty -} -val freqItemSetsAndCounts = getFreqItemAndCounts(minCount, database) -val freqItems = freqItemSetsAndCounts.keys.flatten.toSet -val filteredDatabase = database.map { suffix => - suffix -.map(item => freqItems.intersect(item)) -.filter(_.nonEmpty) -} -freqItemSetsAndCounts.iterator.flatMap { case (item, count) => - val newPrefixes = item :: prefixes - val newProjected = project(filteredDatabase, item) - Iterator.single((newPrefixes, count)) ++ -run(minCount, maxPatternLength, newPrefixes, newProjected) + def run(postfixe
spark git commit: Update docs/README.md to put all prereqs together.
Repository: spark Updated Branches: refs/heads/master d34bac0e1 -> f7abd6bec Update docs/README.md to put all prereqs together. This pull request groups all the prereq requirements into a single section. cc srowen shivaram Author: Reynold Xin Closes #7951 from rxin/readme-docs and squashes the following commits: ab7ded0 [Reynold Xin] Updated docs/README.md to put all prereqs together. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7abd6be Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7abd6be Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7abd6be Branch: refs/heads/master Commit: f7abd6bec9d51ed4ab6359e50eac853e64ecae86 Parents: d34bac0 Author: Reynold Xin Authored: Tue Aug 4 22:17:14 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 22:17:14 2015 -0700 -- docs/README.md | 43 --- 1 file changed, 8 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f7abd6be/docs/README.md -- diff --git a/docs/README.md b/docs/README.md index 5020989..1f4fd3e 100644 --- a/docs/README.md +++ b/docs/README.md @@ -9,12 +9,13 @@ documentation yourself. Why build it yourself? So that you have the docs that co whichever version of Spark you currently have checked out of revision control. ## Prerequisites -The Spark documenation build uses a number of tools to build HTML docs and API docs in Scala, Python -and R. To get started you can run the following commands +The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, +Python and R. To get started you can run the following commands $ sudo gem install jekyll $ sudo gem install jekyll-redirect-from $ sudo pip install Pygments +$ sudo pip install sphinx $ Rscript -e 'install.packages(c("knitr", "devtools"), repos="http://cran.stat.ucla.edu/";)' @@ -29,17 +30,12 @@ you have checked out or downloaded. In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md. -The markdown code can be compiled to HTML using the [Jekyll tool](http://jekyllrb.com). -`Jekyll` and a few dependencies must be installed for this to work. We recommend -installing via the Ruby Gem dependency manager. Since the exact HTML output -varies between versions of Jekyll and its dependencies, we list specific versions here -in some cases: +Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with +Jekyll will create a directory called `_site` containing index.html as well as the rest of the +compiled files. -$ sudo gem install jekyll -$ sudo gem install jekyll-redirect-from - -Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with Jekyll will create a directory -called `_site` containing index.html as well as the rest of the compiled files. +$ cd docs +$ jekyll build You can modify the default Jekyll build as follows: @@ -50,29 +46,6 @@ You can modify the default Jekyll build as follows: # Build the site with extra features used on the live page $ PRODUCTION=1 jekyll build -## Pygments - -We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages, -so you will also need to install that (it requires Python) by running `sudo pip install Pygments`. - -To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile -phase, use the following sytax: - -{% highlight scala %} -// Your scala code goes here, you can replace scala with many other -// supported languages too. -{% endhighlight %} - -## Sphinx - -We use Sphinx to generate Python API docs, so you will need to install it by running -`sudo pip install sphinx`. - -## knitr, devtools - -SparkR documentation is written using `roxygen2` and we use `knitr`, `devtools` to generate -documentation. To install these packages you can run `install.packages(c("knitr", "devtools"))` from a -R console. ## API Docs (Scaladoc, Sphinx, roxygen2) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9504] [STREAMING] [TESTS] Fix o.a.s.streaming.StreamingContextSuite.stop gracefully again
Repository: spark Updated Branches: refs/heads/branch-1.5 d196d3607 -> 6e72d24e2 [SPARK-9504] [STREAMING] [TESTS] Fix o.a.s.streaming.StreamingContextSuite.stop gracefully again The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3150/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition in TestReceiver that it may add 1 record and increase `TestReceiver.counter` after stopping `BlockGenerator`. This PR just adds `join` to wait the pushing thread. Author: zsxwing Closes #7934 from zsxwing/SPARK-9504-2 and squashes the following commits: cfd7973 [zsxwing] Wait for the thread to make sure we won't change TestReceiver.counter after stopping BlockGenerator (cherry picked from commit d34bac0e156432ca6a260db73dbe1318060e309c) Signed-off-by: Tathagata Das Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6e72d24e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6e72d24e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6e72d24e Branch: refs/heads/branch-1.5 Commit: 6e72d24e24d0e6d140e1b25597edae3f3054f98a Parents: d196d36 Author: zsxwing Authored: Tue Aug 4 20:09:15 2015 -0700 Committer: Tathagata Das Committed: Tue Aug 4 20:09:27 2015 -0700 -- .../scala/org/apache/spark/streaming/StreamingContextSuite.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6e72d24e/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala -- diff --git a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala index b7db280..7423ef6 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala @@ -789,7 +789,8 @@ class TestReceiver extends Receiver[Int](StorageLevel.MEMORY_ONLY) with Logging } def onStop() { -// no clean to be done, the receiving thread should stop on it own +// no clean to be done, the receiving thread should stop on it own, so just wait for it. +receivingThreadOption.foreach(_.join()) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9504] [STREAMING] [TESTS] Fix o.a.s.streaming.StreamingContextSuite.stop gracefully again
Repository: spark Updated Branches: refs/heads/master 2b67fdb60 -> d34bac0e1 [SPARK-9504] [STREAMING] [TESTS] Fix o.a.s.streaming.StreamingContextSuite.stop gracefully again The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3150/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition in TestReceiver that it may add 1 record and increase `TestReceiver.counter` after stopping `BlockGenerator`. This PR just adds `join` to wait the pushing thread. Author: zsxwing Closes #7934 from zsxwing/SPARK-9504-2 and squashes the following commits: cfd7973 [zsxwing] Wait for the thread to make sure we won't change TestReceiver.counter after stopping BlockGenerator Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d34bac0e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d34bac0e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d34bac0e Branch: refs/heads/master Commit: d34bac0e156432ca6a260db73dbe1318060e309c Parents: 2b67fdb Author: zsxwing Authored: Tue Aug 4 20:09:15 2015 -0700 Committer: Tathagata Das Committed: Tue Aug 4 20:09:15 2015 -0700 -- .../scala/org/apache/spark/streaming/StreamingContextSuite.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d34bac0e/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala -- diff --git a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala index b7db280..7423ef6 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala @@ -789,7 +789,8 @@ class TestReceiver extends Receiver[Int](StorageLevel.MEMORY_ONLY) with Logging } def onStop() { -// no clean to be done, the receiving thread should stop on it own +// no clean to be done, the receiving thread should stop on it own, so just wait for it. +receivingThreadOption.foreach(_.join()) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9513] [SQL] [PySpark] Add python API for DataFrame functions
Repository: spark Updated Branches: refs/heads/branch-1.5 f957c59b3 -> d196d3607 [SPARK-9513] [SQL] [PySpark] Add python API for DataFrame functions This adds Python API for those DataFrame functions that is introduced in 1.5. There is issue with serialize byte_array in Python 3, so some of functions (for BinaryType) does not have tests. cc rxin Author: Davies Liu Closes #7922 from davies/python_functions and squashes the following commits: 8ad942f [Davies Liu] fix test 5fb6ec3 [Davies Liu] fix bugs 3495ed3 [Davies Liu] fix issues ea5f7bb [Davies Liu] Add python API for DataFrame functions (cherry picked from commit 2b67fdb60be95778e016efae4f0a9cdf2fbfe779) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d196d360 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d196d360 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d196d360 Branch: refs/heads/branch-1.5 Commit: d196d3607c26621062369fbe283597f76b85b11c Parents: f957c59 Author: Davies Liu Authored: Tue Aug 4 19:25:24 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 19:25:35 2015 -0700 -- python/pyspark/sql/functions.py | 885 +-- .../scala/org/apache/spark/sql/functions.scala | 80 +- 2 files changed, 628 insertions(+), 337 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d196d360/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index a73ecc7..e65b14d 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -32,41 +32,6 @@ from pyspark.sql.types import StringType from pyspark.sql.column import Column, _to_java_column, _to_seq -__all__ = [ -'array', -'approxCountDistinct', -'bin', -'coalesce', -'countDistinct', -'explode', -'format_number', -'length', -'log2', -'md5', -'monotonicallyIncreasingId', -'rand', -'randn', -'regexp_extract', -'regexp_replace', -'sha1', -'sha2', -'size', -'sort_array', -'sparkPartitionId', -'struct', -'udf', -'when'] - -__all__ += ['lag', 'lead', 'ntile'] - -__all__ += [ -'date_format', 'date_add', 'date_sub', 'add_months', 'months_between', -'year', 'quarter', 'month', 'hour', 'minute', 'second', -'dayofmonth', 'dayofyear', 'weekofyear'] - -__all__ += ['soundex', 'substring', 'substring_index'] - - def _create_function(name, doc=""): """ Create a function for aggregator by name""" def _(col): @@ -208,30 +173,6 @@ for _name, _doc in _binary_mathfunctions.items(): for _name, _doc in _window_functions.items(): globals()[_name] = since(1.4)(_create_window_function(_name, _doc)) del _name, _doc -__all__ += _functions.keys() -__all__ += _functions_1_4.keys() -__all__ += _binary_mathfunctions.keys() -__all__ += _window_functions.keys() -__all__.sort() - - -@since(1.4) -def array(*cols): -"""Creates a new array column. - -:param cols: list of column names (string) or list of :class:`Column` expressions that have -the same data type. - ->>> df.select(array('age', 'age').alias("arr")).collect() -[Row(arr=[2, 2]), Row(arr=[5, 5])] ->>> df.select(array([df.age, df.age]).alias("arr")).collect() -[Row(arr=[2, 2]), Row(arr=[5, 5])] -""" -sc = SparkContext._active_spark_context -if len(cols) == 1 and isinstance(cols[0], (list, set)): -cols = cols[0] -jc = sc._jvm.functions.array(_to_seq(sc, cols, _to_java_column)) -return Column(jc) @since(1.3) @@ -249,19 +190,6 @@ def approxCountDistinct(col, rsd=None): return Column(jc) -@ignore_unicode_prefix -@since(1.5) -def bin(col): -"""Returns the string representation of the binary value of the given column. - ->>> df.select(bin(df.age).alias('c')).collect() -[Row(c=u'10'), Row(c=u'101')] -""" -sc = SparkContext._active_spark_context -jc = sc._jvm.functions.bin(_to_java_column(col)) -return Column(jc) - - @since(1.4) def coalesce(*cols): """Returns the first column that is not null. @@ -315,82 +243,6 @@ def countDistinct(col, *cols): @since(1.4) -def explode(col): -"""Returns a new row for each element in the given array or map. - ->>> from pyspark.sql import Row ->>> eDF = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})]) ->>> eDF.select(explode(eDF.intlist).alias("anInt")).collect() -[Row(anInt=1), Row(anInt=2), Row(anInt=3)] - ->>> eDF.select(explode(eDF.mapfield).alias("key", "value")).show() -+---+-+ -|key|value| -+---+-+ -| a|b| -+---+-+ -""" -sc = SparkContext._active_spark_context
spark git commit: [SPARK-9513] [SQL] [PySpark] Add python API for DataFrame functions
Repository: spark Updated Branches: refs/heads/master 6f8f0e265 -> 2b67fdb60 [SPARK-9513] [SQL] [PySpark] Add python API for DataFrame functions This adds Python API for those DataFrame functions that is introduced in 1.5. There is issue with serialize byte_array in Python 3, so some of functions (for BinaryType) does not have tests. cc rxin Author: Davies Liu Closes #7922 from davies/python_functions and squashes the following commits: 8ad942f [Davies Liu] fix test 5fb6ec3 [Davies Liu] fix bugs 3495ed3 [Davies Liu] fix issues ea5f7bb [Davies Liu] Add python API for DataFrame functions Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b67fdb6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b67fdb6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b67fdb6 Branch: refs/heads/master Commit: 2b67fdb60be95778e016efae4f0a9cdf2fbfe779 Parents: 6f8f0e2 Author: Davies Liu Authored: Tue Aug 4 19:25:24 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 19:25:24 2015 -0700 -- python/pyspark/sql/functions.py | 885 +-- .../scala/org/apache/spark/sql/functions.scala | 80 +- 2 files changed, 628 insertions(+), 337 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2b67fdb6/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index a73ecc7..e65b14d 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -32,41 +32,6 @@ from pyspark.sql.types import StringType from pyspark.sql.column import Column, _to_java_column, _to_seq -__all__ = [ -'array', -'approxCountDistinct', -'bin', -'coalesce', -'countDistinct', -'explode', -'format_number', -'length', -'log2', -'md5', -'monotonicallyIncreasingId', -'rand', -'randn', -'regexp_extract', -'regexp_replace', -'sha1', -'sha2', -'size', -'sort_array', -'sparkPartitionId', -'struct', -'udf', -'when'] - -__all__ += ['lag', 'lead', 'ntile'] - -__all__ += [ -'date_format', 'date_add', 'date_sub', 'add_months', 'months_between', -'year', 'quarter', 'month', 'hour', 'minute', 'second', -'dayofmonth', 'dayofyear', 'weekofyear'] - -__all__ += ['soundex', 'substring', 'substring_index'] - - def _create_function(name, doc=""): """ Create a function for aggregator by name""" def _(col): @@ -208,30 +173,6 @@ for _name, _doc in _binary_mathfunctions.items(): for _name, _doc in _window_functions.items(): globals()[_name] = since(1.4)(_create_window_function(_name, _doc)) del _name, _doc -__all__ += _functions.keys() -__all__ += _functions_1_4.keys() -__all__ += _binary_mathfunctions.keys() -__all__ += _window_functions.keys() -__all__.sort() - - -@since(1.4) -def array(*cols): -"""Creates a new array column. - -:param cols: list of column names (string) or list of :class:`Column` expressions that have -the same data type. - ->>> df.select(array('age', 'age').alias("arr")).collect() -[Row(arr=[2, 2]), Row(arr=[5, 5])] ->>> df.select(array([df.age, df.age]).alias("arr")).collect() -[Row(arr=[2, 2]), Row(arr=[5, 5])] -""" -sc = SparkContext._active_spark_context -if len(cols) == 1 and isinstance(cols[0], (list, set)): -cols = cols[0] -jc = sc._jvm.functions.array(_to_seq(sc, cols, _to_java_column)) -return Column(jc) @since(1.3) @@ -249,19 +190,6 @@ def approxCountDistinct(col, rsd=None): return Column(jc) -@ignore_unicode_prefix -@since(1.5) -def bin(col): -"""Returns the string representation of the binary value of the given column. - ->>> df.select(bin(df.age).alias('c')).collect() -[Row(c=u'10'), Row(c=u'101')] -""" -sc = SparkContext._active_spark_context -jc = sc._jvm.functions.bin(_to_java_column(col)) -return Column(jc) - - @since(1.4) def coalesce(*cols): """Returns the first column that is not null. @@ -315,82 +243,6 @@ def countDistinct(col, *cols): @since(1.4) -def explode(col): -"""Returns a new row for each element in the given array or map. - ->>> from pyspark.sql import Row ->>> eDF = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})]) ->>> eDF.select(explode(eDF.intlist).alias("anInt")).collect() -[Row(anInt=1), Row(anInt=2), Row(anInt=3)] - ->>> eDF.select(explode(eDF.mapfield).alias("key", "value")).show() -+---+-+ -|key|value| -+---+-+ -| a|b| -+---+-+ -""" -sc = SparkContext._active_spark_context -jc = sc._jvm.functions.explode(_to_java_column(col)) -return Column(jc) - - -@ignore_unicode_pref
spark git commit: [SPARK-7119] [SQL] Give script a default serde with the user specific types
Repository: spark Updated Branches: refs/heads/branch-1.5 11d231159 -> f957c59b3 [SPARK-7119] [SQL] Give script a default serde with the user specific types This is to address this issue that there would be not compatible type exception when running this: `from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2;` 15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) chenghao-intel marmbrus Author: zhichao.li Closes #6638 from zhichao-li/transDataType2 and squashes the following commits: a36cc7c [zhichao.li] style b9252a8 [zhichao.li] delete cacheRow f6968a4 [zhichao.li] give script a default serde (cherry picked from commit 6f8f0e265a29e89bd5192a8d5217cba19f0875da) Signed-off-by: Michael Armbrust Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f957c59b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f957c59b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f957c59b Branch: refs/heads/branch-1.5 Commit: f957c59b3f7f8851103bb1e36d053dc1402ebb0c Parents: 11d2311 Author: zhichao.li Authored: Tue Aug 4 18:26:05 2015 -0700 Committer: Michael Armbrust Committed: Tue Aug 4 18:26:18 2015 -0700 -- .../org/apache/spark/sql/hive/HiveQl.scala | 3 +- .../hive/execution/ScriptTransformation.scala | 96 .../sql/hive/execution/SQLQuerySuite.scala | 10 ++ 3 files changed, 49 insertions(+), 60 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f957c59b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala index e2fdfc6..f43e403 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala @@ -21,6 +21,7 @@ import java.sql.Date import java.util.Locale import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.conf.HiveConf.ConfVars import org.apache.hadoop.hive.serde.serdeConstants import org.apache.hadoop.hive.ql.{ErrorMsg, Con
spark git commit: [SPARK-7119] [SQL] Give script a default serde with the user specific types
Repository: spark Updated Branches: refs/heads/master c9a4c36d0 -> 6f8f0e265 [SPARK-7119] [SQL] Give script a default serde with the user specific types This is to address this issue that there would be not compatible type exception when running this: `from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2;` 15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) chenghao-intel marmbrus Author: zhichao.li Closes #6638 from zhichao-li/transDataType2 and squashes the following commits: a36cc7c [zhichao.li] style b9252a8 [zhichao.li] delete cacheRow f6968a4 [zhichao.li] give script a default serde Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6f8f0e26 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6f8f0e26 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6f8f0e26 Branch: refs/heads/master Commit: 6f8f0e265a29e89bd5192a8d5217cba19f0875da Parents: c9a4c36 Author: zhichao.li Authored: Tue Aug 4 18:26:05 2015 -0700 Committer: Michael Armbrust Committed: Tue Aug 4 18:26:05 2015 -0700 -- .../org/apache/spark/sql/hive/HiveQl.scala | 3 +- .../hive/execution/ScriptTransformation.scala | 96 .../sql/hive/execution/SQLQuerySuite.scala | 10 ++ 3 files changed, 49 insertions(+), 60 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6f8f0e26/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala index e2fdfc6..f43e403 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala @@ -21,6 +21,7 @@ import java.sql.Date import java.util.Locale import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.conf.HiveConf.ConfVars import org.apache.hadoop.hive.serde.serdeConstants import org.apache.hadoop.hive.ql.{ErrorMsg, Context} import org.apache.hadoop.hive.ql.exec.{FunctionRegistry, FunctionInfo} @@ -907,7 +908,7 @@ https://cwik
spark git commit: [SPARK-8313] R Spark packages support
Repository: spark Updated Branches: refs/heads/branch-1.5 02a6333d2 -> 11d231159 [SPARK-8313] R Spark packages support shivaram cafreeman Could you please help me in testing this out? Exposing and running `rPackageBuilder` from inside the shell works, but for some reason, I can't get it to work during Spark Submit. It just starts relaunching Spark Submit. For testing, you may use the R branch with [sbt-spark-package](https://github.com/databricks/sbt-spark-package). You can call spPackage, and then pass the jar using `--jars`. Author: Burak Yavuz Closes #7139 from brkyvz/r-submit and squashes the following commits: 0de384f [Burak Yavuz] remove unused imports 2 d253708 [Burak Yavuz] removed unused imports 6603d0d [Burak Yavuz] addressed comments 4258ffe [Burak Yavuz] merged master ddfcc06 [Burak Yavuz] added zipping test 3a1be7d [Burak Yavuz] don't zip 77995df [Burak Yavuz] fix URI ac45527 [Burak Yavuz] added zipping of all libs e6bf7b0 [Burak Yavuz] add println ignores 1bc5554 [Burak Yavuz] add assumes for tests 9778e03 [Burak Yavuz] addressed comments b42b300 [Burak Yavuz] merged master ffd134e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit d867756 [Burak Yavuz] add apache header eff5ba1 [Burak Yavuz] ready for review 8838edb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit e5b5a06 [Burak Yavuz] added doc bb751ce [Burak Yavuz] fix null bug 0226768 [Burak Yavuz] fixed issues 8810beb [Burak Yavuz] R packages support (cherry picked from commit c9a4c36d052456c2dd1f7e0a871c6b764b5064d2) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/11d23115 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/11d23115 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/11d23115 Branch: refs/heads/branch-1.5 Commit: 11d2311593587a52ee5015fb0ffd6403ea1138b0 Parents: 02a6333 Author: Burak Yavuz Authored: Tue Aug 4 18:20:12 2015 -0700 Committer: Shivaram Venkataraman Committed: Tue Aug 4 18:20:20 2015 -0700 -- R/install-dev.sh| 4 - R/pkg/inst/tests/packageInAJarTest.R| 30 +++ .../scala/org/apache/spark/api/r/RUtils.scala | 14 +- .../org/apache/spark/deploy/RPackageUtils.scala | 232 +++ .../org/apache/spark/deploy/SparkSubmit.scala | 11 +- .../spark/deploy/SparkSubmitArguments.scala | 1 - .../org/apache/spark/deploy/IvyTestUtils.scala | 101 ++-- .../spark/deploy/RPackageUtilsSuite.scala | 156 + .../apache/spark/deploy/SparkSubmitSuite.scala | 24 ++ 9 files changed, 538 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/11d23115/R/install-dev.sh -- diff --git a/R/install-dev.sh b/R/install-dev.sh index 4972bb9..59d98c9 100755 --- a/R/install-dev.sh +++ b/R/install-dev.sh @@ -42,8 +42,4 @@ Rscript -e ' if("devtools" %in% rownames(installed.packages())) { library(devtoo # Install SparkR to $LIB_DIR R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ -# Zip the SparkR package so that it can be distributed to worker nodes on YARN -cd $LIB_DIR -jar cfM "$LIB_DIR/sparkr.zip" SparkR - popd > /dev/null http://git-wip-us.apache.org/repos/asf/spark/blob/11d23115/R/pkg/inst/tests/packageInAJarTest.R -- diff --git a/R/pkg/inst/tests/packageInAJarTest.R b/R/pkg/inst/tests/packageInAJarTest.R new file mode 100644 index 000..207a37a --- /dev/null +++ b/R/pkg/inst/tests/packageInAJarTest.R @@ -0,0 +1,30 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +library(SparkR) +library(sparkPackageTest) + +sc <- sparkR.init() + +run1 <- myfunc(5L) + +run2 <- myfunc(-4L) + +sparkR.stop() + +if(run1 != 6) quit(save = "no", status = 1) + +if(run2 != -3) quit(save = "no", status = 1) http://git-wip-us.apache.org/repos/asf/spark/blob/11d23115/core/src/main/scala/org/apache/spark/api/r/RUtils.scala --
spark git commit: [SPARK-8313] R Spark packages support
Repository: spark Updated Branches: refs/heads/master a7fe48f68 -> c9a4c36d0 [SPARK-8313] R Spark packages support shivaram cafreeman Could you please help me in testing this out? Exposing and running `rPackageBuilder` from inside the shell works, but for some reason, I can't get it to work during Spark Submit. It just starts relaunching Spark Submit. For testing, you may use the R branch with [sbt-spark-package](https://github.com/databricks/sbt-spark-package). You can call spPackage, and then pass the jar using `--jars`. Author: Burak Yavuz Closes #7139 from brkyvz/r-submit and squashes the following commits: 0de384f [Burak Yavuz] remove unused imports 2 d253708 [Burak Yavuz] removed unused imports 6603d0d [Burak Yavuz] addressed comments 4258ffe [Burak Yavuz] merged master ddfcc06 [Burak Yavuz] added zipping test 3a1be7d [Burak Yavuz] don't zip 77995df [Burak Yavuz] fix URI ac45527 [Burak Yavuz] added zipping of all libs e6bf7b0 [Burak Yavuz] add println ignores 1bc5554 [Burak Yavuz] add assumes for tests 9778e03 [Burak Yavuz] addressed comments b42b300 [Burak Yavuz] merged master ffd134e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit d867756 [Burak Yavuz] add apache header eff5ba1 [Burak Yavuz] ready for review 8838edb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit e5b5a06 [Burak Yavuz] added doc bb751ce [Burak Yavuz] fix null bug 0226768 [Burak Yavuz] fixed issues 8810beb [Burak Yavuz] R packages support Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c9a4c36d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c9a4c36d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c9a4c36d Branch: refs/heads/master Commit: c9a4c36d052456c2dd1f7e0a871c6b764b5064d2 Parents: a7fe48f Author: Burak Yavuz Authored: Tue Aug 4 18:20:12 2015 -0700 Committer: Shivaram Venkataraman Committed: Tue Aug 4 18:20:12 2015 -0700 -- R/install-dev.sh| 4 - R/pkg/inst/tests/packageInAJarTest.R| 30 +++ .../scala/org/apache/spark/api/r/RUtils.scala | 14 +- .../org/apache/spark/deploy/RPackageUtils.scala | 232 +++ .../org/apache/spark/deploy/SparkSubmit.scala | 11 +- .../spark/deploy/SparkSubmitArguments.scala | 1 - .../org/apache/spark/deploy/IvyTestUtils.scala | 101 ++-- .../spark/deploy/RPackageUtilsSuite.scala | 156 + .../apache/spark/deploy/SparkSubmitSuite.scala | 24 ++ 9 files changed, 538 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c9a4c36d/R/install-dev.sh -- diff --git a/R/install-dev.sh b/R/install-dev.sh index 4972bb9..59d98c9 100755 --- a/R/install-dev.sh +++ b/R/install-dev.sh @@ -42,8 +42,4 @@ Rscript -e ' if("devtools" %in% rownames(installed.packages())) { library(devtoo # Install SparkR to $LIB_DIR R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/ -# Zip the SparkR package so that it can be distributed to worker nodes on YARN -cd $LIB_DIR -jar cfM "$LIB_DIR/sparkr.zip" SparkR - popd > /dev/null http://git-wip-us.apache.org/repos/asf/spark/blob/c9a4c36d/R/pkg/inst/tests/packageInAJarTest.R -- diff --git a/R/pkg/inst/tests/packageInAJarTest.R b/R/pkg/inst/tests/packageInAJarTest.R new file mode 100644 index 000..207a37a --- /dev/null +++ b/R/pkg/inst/tests/packageInAJarTest.R @@ -0,0 +1,30 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +library(SparkR) +library(sparkPackageTest) + +sc <- sparkR.init() + +run1 <- myfunc(5L) + +run2 <- myfunc(-4L) + +sparkR.stop() + +if(run1 != 6) quit(save = "no", status = 1) + +if(run2 != -3) quit(save = "no", status = 1) http://git-wip-us.apache.org/repos/asf/spark/blob/c9a4c36d/core/src/main/scala/org/apache/spark/api/r/RUtils.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/r/RUt
spark git commit: [SPARK-9432][SQL] Audit expression unit tests to make sure we pass the proper numeric ranges
Repository: spark Updated Branches: refs/heads/branch-1.5 2237ddbe0 -> 02a6333d2 [SPARK-9432][SQL] Audit expression unit tests to make sure we pass the proper numeric ranges JIRA: https://issues.apache.org/jira/browse/SPARK-9432 Author: Yijie Shen Closes #7933 from yjshen/numeric_ranges and squashes the following commits: e719f78 [Yijie Shen] proper integral range check (cherry picked from commit a7fe48f68727d5c0247698cff329fb12faff1d50) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02a6333d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02a6333d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02a6333d Branch: refs/heads/branch-1.5 Commit: 02a6333d23345275fba90a3ffa66dc3c3c0e0ff0 Parents: 2237ddb Author: Yijie Shen Authored: Tue Aug 4 18:19:26 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 18:19:34 2015 -0700 -- .../expressions/ArithmeticExpressionSuite.scala | 51 ++-- .../expressions/BitwiseFunctionsSuite.scala | 21 .../expressions/DateExpressionsSuite.scala | 14 ++ .../expressions/IntegralLiteralTestUtils.scala | 42 .../expressions/MathFunctionsSuite.scala| 40 +++ 5 files changed, 164 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/02a6333d/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala index 0bae8fe..a1f15e4 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala @@ -23,6 +23,8 @@ import org.apache.spark.sql.types._ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper { + import IntegralLiteralTestUtils._ + /** * Runs through the testFunc for all numeric data types. * @@ -47,6 +49,9 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Add(Literal.create(null, left.dataType), right), null) checkEvaluation(Add(left, Literal.create(null, right.dataType)), null) } +checkEvaluation(Add(positiveShortLit, negativeShortLit), -1.toShort) +checkEvaluation(Add(positiveIntLit, negativeIntLit), -1) +checkEvaluation(Add(positiveLongLit, negativeLongLit), -1L) } test("- (UnaryMinus)") { @@ -60,6 +65,12 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(UnaryMinus(Literal(Int.MinValue)), Int.MinValue) checkEvaluation(UnaryMinus(Literal(Short.MinValue)), Short.MinValue) checkEvaluation(UnaryMinus(Literal(Byte.MinValue)), Byte.MinValue) +checkEvaluation(UnaryMinus(positiveShortLit), (- positiveShort).toShort) +checkEvaluation(UnaryMinus(negativeShortLit), (- negativeShort).toShort) +checkEvaluation(UnaryMinus(positiveIntLit), - positiveInt) +checkEvaluation(UnaryMinus(negativeIntLit), - negativeInt) +checkEvaluation(UnaryMinus(positiveLongLit), - positiveLong) +checkEvaluation(UnaryMinus(negativeLongLit), - negativeLong) } test("- (Minus)") { @@ -70,6 +81,10 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Subtract(Literal.create(null, left.dataType), right), null) checkEvaluation(Subtract(left, Literal.create(null, right.dataType)), null) } +checkEvaluation(Subtract(positiveShortLit, negativeShortLit), + (positiveShort - negativeShort).toShort) +checkEvaluation(Subtract(positiveIntLit, negativeIntLit), positiveInt - negativeInt) +checkEvaluation(Subtract(positiveLongLit, negativeLongLit), positiveLong - negativeLong) } test("* (Multiply)") { @@ -80,6 +95,10 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Multiply(Literal.create(null, left.dataType), right), null) checkEvaluation(Multiply(left, Literal.create(null, right.dataType)), null) } +checkEvaluation(Multiply(positiveShortLit, negativeShortLit), + (positiveShort * negativeShort).toShort) +checkEvaluation(Multiply(positiveIntLit, negativeIntLit), positiveInt * negativeInt) +checkEvaluation(Multiply(positiveLongLit, negativeLongLit), positiveLong * negativeLong) } test("/ (Divide) basic") { @@ -99,6 +118,9 @@ class ArithmeticExpressionSuite extends Sp
spark git commit: [SPARK-9432][SQL] Audit expression unit tests to make sure we pass the proper numeric ranges
Repository: spark Updated Branches: refs/heads/master d92fa1417 -> a7fe48f68 [SPARK-9432][SQL] Audit expression unit tests to make sure we pass the proper numeric ranges JIRA: https://issues.apache.org/jira/browse/SPARK-9432 Author: Yijie Shen Closes #7933 from yjshen/numeric_ranges and squashes the following commits: e719f78 [Yijie Shen] proper integral range check Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a7fe48f6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a7fe48f6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a7fe48f6 Branch: refs/heads/master Commit: a7fe48f68727d5c0247698cff329fb12faff1d50 Parents: d92fa14 Author: Yijie Shen Authored: Tue Aug 4 18:19:26 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 18:19:26 2015 -0700 -- .../expressions/ArithmeticExpressionSuite.scala | 51 ++-- .../expressions/BitwiseFunctionsSuite.scala | 21 .../expressions/DateExpressionsSuite.scala | 14 ++ .../expressions/IntegralLiteralTestUtils.scala | 42 .../expressions/MathFunctionsSuite.scala| 40 +++ 5 files changed, 164 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a7fe48f6/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala index 0bae8fe..a1f15e4 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala @@ -23,6 +23,8 @@ import org.apache.spark.sql.types._ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper { + import IntegralLiteralTestUtils._ + /** * Runs through the testFunc for all numeric data types. * @@ -47,6 +49,9 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Add(Literal.create(null, left.dataType), right), null) checkEvaluation(Add(left, Literal.create(null, right.dataType)), null) } +checkEvaluation(Add(positiveShortLit, negativeShortLit), -1.toShort) +checkEvaluation(Add(positiveIntLit, negativeIntLit), -1) +checkEvaluation(Add(positiveLongLit, negativeLongLit), -1L) } test("- (UnaryMinus)") { @@ -60,6 +65,12 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(UnaryMinus(Literal(Int.MinValue)), Int.MinValue) checkEvaluation(UnaryMinus(Literal(Short.MinValue)), Short.MinValue) checkEvaluation(UnaryMinus(Literal(Byte.MinValue)), Byte.MinValue) +checkEvaluation(UnaryMinus(positiveShortLit), (- positiveShort).toShort) +checkEvaluation(UnaryMinus(negativeShortLit), (- negativeShort).toShort) +checkEvaluation(UnaryMinus(positiveIntLit), - positiveInt) +checkEvaluation(UnaryMinus(negativeIntLit), - negativeInt) +checkEvaluation(UnaryMinus(positiveLongLit), - positiveLong) +checkEvaluation(UnaryMinus(negativeLongLit), - negativeLong) } test("- (Minus)") { @@ -70,6 +81,10 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Subtract(Literal.create(null, left.dataType), right), null) checkEvaluation(Subtract(left, Literal.create(null, right.dataType)), null) } +checkEvaluation(Subtract(positiveShortLit, negativeShortLit), + (positiveShort - negativeShort).toShort) +checkEvaluation(Subtract(positiveIntLit, negativeIntLit), positiveInt - negativeInt) +checkEvaluation(Subtract(positiveLongLit, negativeLongLit), positiveLong - negativeLong) } test("* (Multiply)") { @@ -80,6 +95,10 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Multiply(Literal.create(null, left.dataType), right), null) checkEvaluation(Multiply(left, Literal.create(null, right.dataType)), null) } +checkEvaluation(Multiply(positiveShortLit, negativeShortLit), + (positiveShort * negativeShort).toShort) +checkEvaluation(Multiply(positiveIntLit, negativeIntLit), positiveInt * negativeInt) +checkEvaluation(Multiply(positiveLongLit, negativeLongLit), positiveLong * negativeLong) } test("/ (Divide) basic") { @@ -99,6 +118,9 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Divide(Literal(1.toShort), Literal(2.toShort))
spark git commit: [SPARK-8601] [ML] Add an option to disable standardization for linear regression
Repository: spark Updated Branches: refs/heads/master 629e26f7e -> d92fa1417 [SPARK-8601] [ML] Add an option to disable standardization for linear regression All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. standardize Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian". Note that the primary author for this PR is holdenk Author: Holden Karau Author: DB Tsai Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits: e856036 [DB Tsai] scala doc 596e96c [DB Tsai] minor bbff347 [DB Tsai] naming baa0805 [DB Tsai] touch up d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 332f140 [Holden Karau] Merge in master eebe10a [Holden Karau] Use same comparision operator throughout the test 3f92935 [Holden Karau] merge b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR). 0c334a2 [Holden Karau] Remove extra line 99ce053 [Holden Karau] merge in master e54a8a9 [Holden Karau] Fix long line e47c574 [Holden Karau] Add support for L2 without standardization. 55d3a66 [Holden Karau] Add standardization param for linear regression 00a1dc5 [Holden Karau] Add the param to the linearregression impl Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d92fa141 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d92fa141 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d92fa141 Branch: refs/heads/master Commit: d92fa14179287c996407d9c7d249103109f9cdef Parents: 629e26f Author: Holden Karau Authored: Tue Aug 4 18:15:26 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 18:15:26 2015 -0700 -- .../ml/classification/LogisticRegression.scala | 6 +- .../spark/ml/regression/LinearRegression.scala | 70 - .../ml/regression/LinearRegressionSuite.scala | 278 ++- 3 files changed, 268 insertions(+), 86 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d92fa141/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index c937b960..0d07383 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -133,9 +133,9 @@ class LogisticRegression(override val uid: String) /** * Whether to standardize the training features before fitting the model. * The coefficients of models will be always returned on the original scale, - * so it will be transparent for users. Note that when no regularization, - * with or without standardization, the models should be always converged to - * the same solution. + * so it will be transparent for users. Note that with/without standardization, + * the models should be always converged to the same solution when no regularization + * is applied. In R's GLMNET package, the default behavior is true as well. * Default is true. * @group setParam * */ http://git-wip-us.apache.org/repos/asf/spark/blob/d92fa141/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala index 3b85ba0..92d819b 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala +++ b
spark git commit: [SPARK-8601] [ML] Add an option to disable standardization for linear regression
Repository: spark Updated Branches: refs/heads/branch-1.5 335097548 -> 2237ddbe0 [SPARK-8601] [ML] Add an option to disable standardization for linear regression All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. standardize Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian". Note that the primary author for this PR is holdenk Author: Holden Karau Author: DB Tsai Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits: e856036 [DB Tsai] scala doc 596e96c [DB Tsai] minor bbff347 [DB Tsai] naming baa0805 [DB Tsai] touch up d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 332f140 [Holden Karau] Merge in master eebe10a [Holden Karau] Use same comparision operator throughout the test 3f92935 [Holden Karau] merge b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR). 0c334a2 [Holden Karau] Remove extra line 99ce053 [Holden Karau] merge in master e54a8a9 [Holden Karau] Fix long line e47c574 [Holden Karau] Add support for L2 without standardization. 55d3a66 [Holden Karau] Add standardization param for linear regression 00a1dc5 [Holden Karau] Add the param to the linearregression impl (cherry picked from commit d92fa14179287c996407d9c7d249103109f9cdef) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2237ddbe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2237ddbe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2237ddbe Branch: refs/heads/branch-1.5 Commit: 2237ddbe027be084afd85fc5b7a7c22270b6e7f6 Parents: 3350975 Author: Holden Karau Authored: Tue Aug 4 18:15:26 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 18:15:35 2015 -0700 -- .../ml/classification/LogisticRegression.scala | 6 +- .../spark/ml/regression/LinearRegression.scala | 70 - .../ml/regression/LinearRegressionSuite.scala | 278 ++- 3 files changed, 268 insertions(+), 86 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2237ddbe/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index c937b960..0d07383 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -133,9 +133,9 @@ class LogisticRegression(override val uid: String) /** * Whether to standardize the training features before fitting the model. * The coefficients of models will be always returned on the original scale, - * so it will be transparent for users. Note that when no regularization, - * with or without standardization, the models should be always converged to - * the same solution. + * so it will be transparent for users. Note that with/without standardization, + * the models should be always converged to the same solution when no regularization + * is applied. In R's GLMNET package, the default behavior is true as well. * Default is true. * @group setParam * */ http://git-wip-us.apache.org/repos/asf/spark/blob/2237ddbe/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala inde
spark git commit: [SPARK-9609] [MLLIB] Fix spelling of Strategy.defaultStrategy
Repository: spark Updated Branches: refs/heads/branch-1.5 1954a7bb1 -> 335097548 [SPARK-9609] [MLLIB] Fix spelling of Strategy.defaultStrategy jkbradley Author: Feynman Liang Closes #7941 from feynmanliang/SPARK-9609-stategy-spelling and squashes the following commits: d2aafb1 [Feynman Liang] Add deprecated backwards compatibility aa090a8 [Feynman Liang] Fix spelling (cherry picked from commit 629e26f7ee916e70f59b017cb6083aa441b26b2c) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33509754 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33509754 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33509754 Branch: refs/heads/branch-1.5 Commit: 33509754843fe8eba303c720e6c0f6853b861e7e Parents: 1954a7b Author: Feynman Liang Authored: Tue Aug 4 18:13:18 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 18:13:27 2015 -0700 -- .../src/main/scala/org/apache/spark/ml/tree/treeParams.scala | 2 +- .../spark/mllib/tree/configuration/BoostingStrategy.scala| 2 +- .../org/apache/spark/mllib/tree/configuration/Strategy.scala | 8 ++-- 3 files changed, 8 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/33509754/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala index e817090..dbd8d31 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala @@ -163,7 +163,7 @@ private[ml] trait DecisionTreeParams extends PredictorParams { oldAlgo: OldAlgo.Algo, oldImpurity: OldImpurity, subsamplingRate: Double): OldStrategy = { -val strategy = OldStrategy.defaultStategy(oldAlgo) +val strategy = OldStrategy.defaultStrategy(oldAlgo) strategy.impurity = oldImpurity strategy.checkpointInterval = getCheckpointInterval strategy.maxBins = getMaxBins http://git-wip-us.apache.org/repos/asf/spark/blob/33509754/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala index 9fd30c9..50fe2ac 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala @@ -90,7 +90,7 @@ object BoostingStrategy { * @return Configuration for boosting algorithm */ def defaultParams(algo: Algo): BoostingStrategy = { -val treeStrategy = Strategy.defaultStategy(algo) +val treeStrategy = Strategy.defaultStrategy(algo) treeStrategy.maxDepth = 3 algo match { case Algo.Classification => http://git-wip-us.apache.org/repos/asf/spark/blob/33509754/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala index ada227c..de2c784 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala @@ -178,14 +178,14 @@ object Strategy { * @param algo "Classification" or "Regression" */ def defaultStrategy(algo: String): Strategy = { -defaultStategy(Algo.fromString(algo)) +defaultStrategy(Algo.fromString(algo)) } /** * Construct a default set of parameters for [[org.apache.spark.mllib.tree.DecisionTree]] * @param algo Algo.Classification or Algo.Regression */ - def defaultStategy(algo: Algo): Strategy = algo match { + def defaultStrategy(algo: Algo): Strategy = algo match { case Algo.Classification => new Strategy(algo = Classification, impurity = Gini, maxDepth = 10, numClasses = 2) @@ -193,4 +193,8 @@ object Strategy { new Strategy(algo = Regression, impurity = Variance, maxDepth = 10, numClasses = 0) } + + @deprecated("Use Strategy.defaultStrategy instead.", "1.5.0") + def defaultStategy(algo: Algo): Strategy = defaultStrategy(algo) + } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h.
spark git commit: [SPARK-9609] [MLLIB] Fix spelling of Strategy.defaultStrategy
Repository: spark Updated Branches: refs/heads/master 7c8fc1f7c -> 629e26f7e [SPARK-9609] [MLLIB] Fix spelling of Strategy.defaultStrategy jkbradley Author: Feynman Liang Closes #7941 from feynmanliang/SPARK-9609-stategy-spelling and squashes the following commits: d2aafb1 [Feynman Liang] Add deprecated backwards compatibility aa090a8 [Feynman Liang] Fix spelling Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/629e26f7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/629e26f7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/629e26f7 Branch: refs/heads/master Commit: 629e26f7ee916e70f59b017cb6083aa441b26b2c Parents: 7c8fc1f Author: Feynman Liang Authored: Tue Aug 4 18:13:18 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 18:13:18 2015 -0700 -- .../src/main/scala/org/apache/spark/ml/tree/treeParams.scala | 2 +- .../spark/mllib/tree/configuration/BoostingStrategy.scala| 2 +- .../org/apache/spark/mllib/tree/configuration/Strategy.scala | 8 ++-- 3 files changed, 8 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/629e26f7/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala index e817090..dbd8d31 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala @@ -163,7 +163,7 @@ private[ml] trait DecisionTreeParams extends PredictorParams { oldAlgo: OldAlgo.Algo, oldImpurity: OldImpurity, subsamplingRate: Double): OldStrategy = { -val strategy = OldStrategy.defaultStategy(oldAlgo) +val strategy = OldStrategy.defaultStrategy(oldAlgo) strategy.impurity = oldImpurity strategy.checkpointInterval = getCheckpointInterval strategy.maxBins = getMaxBins http://git-wip-us.apache.org/repos/asf/spark/blob/629e26f7/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala index 9fd30c9..50fe2ac 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala @@ -90,7 +90,7 @@ object BoostingStrategy { * @return Configuration for boosting algorithm */ def defaultParams(algo: Algo): BoostingStrategy = { -val treeStrategy = Strategy.defaultStategy(algo) +val treeStrategy = Strategy.defaultStrategy(algo) treeStrategy.maxDepth = 3 algo match { case Algo.Classification => http://git-wip-us.apache.org/repos/asf/spark/blob/629e26f7/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala index ada227c..de2c784 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala @@ -178,14 +178,14 @@ object Strategy { * @param algo "Classification" or "Regression" */ def defaultStrategy(algo: String): Strategy = { -defaultStategy(Algo.fromString(algo)) +defaultStrategy(Algo.fromString(algo)) } /** * Construct a default set of parameters for [[org.apache.spark.mllib.tree.DecisionTree]] * @param algo Algo.Classification or Algo.Regression */ - def defaultStategy(algo: Algo): Strategy = algo match { + def defaultStrategy(algo: Algo): Strategy = algo match { case Algo.Classification => new Strategy(algo = Classification, impurity = Gini, maxDepth = 10, numClasses = 2) @@ -193,4 +193,8 @@ object Strategy { new Strategy(algo = Regression, impurity = Variance, maxDepth = 10, numClasses = 0) } + + @deprecated("Use Strategy.defaultStrategy instead.", "1.5.0") + def defaultStategy(algo: Algo): Strategy = defaultStrategy(algo) + } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9598][SQL] do not expose generic getter in internal row
Repository: spark Updated Branches: refs/heads/master b77d3b968 -> 7c8fc1f7c [SPARK-9598][SQL] do not expose generic getter in internal row Author: Wenchen Fan Closes #7932 from cloud-fan/generic-getter and squashes the following commits: c60de4c [Wenchen Fan] do not expose generic getter in internal row Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7c8fc1f7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7c8fc1f7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7c8fc1f7 Branch: refs/heads/master Commit: 7c8fc1f7cb837ff5c32811fdeb3ee2b84de2dea4 Parents: b77d3b9 Author: Wenchen Fan Authored: Tue Aug 4 17:05:19 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 17:05:19 2015 -0700 -- .../sql/catalyst/expressions/UnsafeRow.java | 5 -- .../apache/spark/sql/catalyst/InternalRow.scala | 37 ++-- .../expressions/GenericSpecializedGetters.scala | 61 .../sql/catalyst/expressions/Projection.scala | 4 +- .../expressions/SpecificMutableRow.scala| 2 +- .../sql/catalyst/expressions/aggregates.scala | 2 +- .../codegen/GenerateProjection.scala| 2 +- .../spark/sql/catalyst/expressions/rows.scala | 12 ++-- .../spark/sql/types/GenericArrayData.scala | 37 +++- .../datasources/DataSourceStrategy.scala| 6 +- .../spark/sql/columnar/ColumnStatsSuite.scala | 20 +++ 11 files changed, 80 insertions(+), 108 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7c8fc1f7/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java index e6750fc..e3e1622 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java @@ -254,11 +254,6 @@ public final class UnsafeRow extends MutableRow { } @Override - public Object genericGet(int ordinal) { -throw new UnsupportedOperationException(); - } - - @Override public Object get(int ordinal, DataType dataType) { if (isNullAt(ordinal) || dataType instanceof NullType) { return null; http://git-wip-us.apache.org/repos/asf/spark/blob/7c8fc1f7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala index 7656d05..7d17cca 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala @@ -17,15 +17,15 @@ package org.apache.spark.sql.catalyst -import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types.{DataType, MapData, ArrayData, Decimal} +import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} /** * An abstract class for row used internal in Spark SQL, which only contain the columns as * internal types. */ -// todo: make InternalRow just extends SpecializedGetters, remove generic getter -abstract class InternalRow extends GenericSpecializedGetters with Serializable { +abstract class InternalRow extends SpecializedGetters with Serializable { def numFields: Int @@ -50,6 +50,31 @@ abstract class InternalRow extends GenericSpecializedGetters with Serializable { false } + // Subclasses of InternalRow should implement all special getters and equals/hashCode, + // or implement this genericGet. + protected def genericGet(ordinal: Int): Any = throw new IllegalStateException( +"Concrete internal rows should implement genericGet, " + + "or implement all special getters and equals/hashCode") + + // default implementation (slow) + private def getAs[T](ordinal: Int) = genericGet(ordinal).asInstanceOf[T] + override def isNullAt(ordinal: Int): Boolean = getAs[AnyRef](ordinal) eq null + override def get(ordinal: Int, dataType: DataType): AnyRef = getAs(ordinal) + override def getBoolean(ordinal: Int): Boolean = getAs(ordinal) + override def getByte(ordinal: Int): Byte = getAs(ordinal) + override def getShort(ordinal: Int): Short = getAs(ordinal) + override def getInt(ordinal: Int): Int = getAs(ordinal) + override def getLong(ordinal: Int): Long = getAs(ordinal) + override def getFloat(ordinal: Int): Float = getAs(ordinal) + override def getDouble(
spark git commit: [SPARK-9598][SQL] do not expose generic getter in internal row
Repository: spark Updated Branches: refs/heads/branch-1.5 cff0fe291 -> 1954a7bb1 [SPARK-9598][SQL] do not expose generic getter in internal row Author: Wenchen Fan Closes #7932 from cloud-fan/generic-getter and squashes the following commits: c60de4c [Wenchen Fan] do not expose generic getter in internal row (cherry picked from commit 7c8fc1f7cb837ff5c32811fdeb3ee2b84de2dea4) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1954a7bb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1954a7bb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1954a7bb Branch: refs/heads/branch-1.5 Commit: 1954a7bb175122b776870530217159cad366ca6c Parents: cff0fe2 Author: Wenchen Fan Authored: Tue Aug 4 17:05:19 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 17:05:27 2015 -0700 -- .../sql/catalyst/expressions/UnsafeRow.java | 5 -- .../apache/spark/sql/catalyst/InternalRow.scala | 37 ++-- .../expressions/GenericSpecializedGetters.scala | 61 .../sql/catalyst/expressions/Projection.scala | 4 +- .../expressions/SpecificMutableRow.scala| 2 +- .../sql/catalyst/expressions/aggregates.scala | 2 +- .../codegen/GenerateProjection.scala| 2 +- .../spark/sql/catalyst/expressions/rows.scala | 12 ++-- .../spark/sql/types/GenericArrayData.scala | 37 +++- .../datasources/DataSourceStrategy.scala| 6 +- .../spark/sql/columnar/ColumnStatsSuite.scala | 20 +++ 11 files changed, 80 insertions(+), 108 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1954a7bb/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java index e6750fc..e3e1622 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java @@ -254,11 +254,6 @@ public final class UnsafeRow extends MutableRow { } @Override - public Object genericGet(int ordinal) { -throw new UnsupportedOperationException(); - } - - @Override public Object get(int ordinal, DataType dataType) { if (isNullAt(ordinal) || dataType instanceof NullType) { return null; http://git-wip-us.apache.org/repos/asf/spark/blob/1954a7bb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala index 7656d05..7d17cca 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala @@ -17,15 +17,15 @@ package org.apache.spark.sql.catalyst -import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types.{DataType, MapData, ArrayData, Decimal} +import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} /** * An abstract class for row used internal in Spark SQL, which only contain the columns as * internal types. */ -// todo: make InternalRow just extends SpecializedGetters, remove generic getter -abstract class InternalRow extends GenericSpecializedGetters with Serializable { +abstract class InternalRow extends SpecializedGetters with Serializable { def numFields: Int @@ -50,6 +50,31 @@ abstract class InternalRow extends GenericSpecializedGetters with Serializable { false } + // Subclasses of InternalRow should implement all special getters and equals/hashCode, + // or implement this genericGet. + protected def genericGet(ordinal: Int): Any = throw new IllegalStateException( +"Concrete internal rows should implement genericGet, " + + "or implement all special getters and equals/hashCode") + + // default implementation (slow) + private def getAs[T](ordinal: Int) = genericGet(ordinal).asInstanceOf[T] + override def isNullAt(ordinal: Int): Boolean = getAs[AnyRef](ordinal) eq null + override def get(ordinal: Int, dataType: DataType): AnyRef = getAs(ordinal) + override def getBoolean(ordinal: Int): Boolean = getAs(ordinal) + override def getByte(ordinal: Int): Byte = getAs(ordinal) + override def getShort(ordinal: Int): Short = getAs(ordinal) + override def getInt(ordinal: Int): Int = getAs(ordinal) + override def getLong(ordinal: Int): Long
spark git commit: [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol
Repository: spark Updated Branches: refs/heads/branch-1.5 f4e125acf -> cff0fe291 [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than setScoreCol. Deprecated setScoreCol. I don't think setScoreCol was actually used anywhere (based on search). CC: mengxr Author: Joseph K. Bradley Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following commits: e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use setRawPredictionCol (cherry picked from commit b77d3b9688d56d33737909375d1d0db07da5827b) Signed-off-by: Xiangrui Meng Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cff0fe29 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cff0fe29 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cff0fe29 Branch: refs/heads/branch-1.5 Commit: cff0fe291aa470ef5cf4e5087c7114fb6360572f Parents: f4e125a Author: Joseph K. Bradley Authored: Tue Aug 4 16:52:43 2015 -0700 Committer: Xiangrui Meng Committed: Tue Aug 4 16:52:53 2015 -0700 -- .../spark/ml/evaluation/BinaryClassificationEvaluator.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cff0fe29/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala index 4a82b77..5d5cb7e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.types.DoubleType /** * :: Experimental :: - * Evaluator for binary classification, which expects two input columns: score and label. + * Evaluator for binary classification, which expects two input columns: rawPrediction and label. */ @Experimental class BinaryClassificationEvaluator(override val uid: String) @@ -50,6 +50,13 @@ class BinaryClassificationEvaluator(override val uid: String) def setMetricName(value: String): this.type = set(metricName, value) /** @group setParam */ + def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, value) + + /** + * @group setParam + * @deprecated use [[setRawPredictionCol()]] instead + */ + @deprecated("use setRawPredictionCol instead", "1.5.0") def setScoreCol(value: String): this.type = set(rawPredictionCol, value) /** @group setParam */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol
Repository: spark Updated Branches: refs/heads/master 571d5b536 -> b77d3b968 [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than setScoreCol. Deprecated setScoreCol. I don't think setScoreCol was actually used anywhere (based on search). CC: mengxr Author: Joseph K. Bradley Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following commits: e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use setRawPredictionCol Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b77d3b96 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b77d3b96 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b77d3b96 Branch: refs/heads/master Commit: b77d3b9688d56d33737909375d1d0db07da5827b Parents: 571d5b5 Author: Joseph K. Bradley Authored: Tue Aug 4 16:52:43 2015 -0700 Committer: Xiangrui Meng Committed: Tue Aug 4 16:52:43 2015 -0700 -- .../spark/ml/evaluation/BinaryClassificationEvaluator.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b77d3b96/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala index 4a82b77..5d5cb7e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.types.DoubleType /** * :: Experimental :: - * Evaluator for binary classification, which expects two input columns: score and label. + * Evaluator for binary classification, which expects two input columns: rawPrediction and label. */ @Experimental class BinaryClassificationEvaluator(override val uid: String) @@ -50,6 +50,13 @@ class BinaryClassificationEvaluator(override val uid: String) def setMetricName(value: String): this.type = set(metricName, value) /** @group setParam */ + def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, value) + + /** + * @group setParam + * @deprecated use [[setRawPredictionCol()]] instead + */ + @deprecated("use setRawPredictionCol instead", "1.5.0") def setScoreCol(value: String): this.type = set(rawPredictionCol, value) /** @group setParam */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
Repository: spark Updated Branches: refs/heads/branch-1.5 fe4a4f41a -> f4e125acf [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark. This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now. Author: Mike Dusenberry Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits: bb039cb [Mike Dusenberry] Minor documentation update. b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included. 7f0dcb6 [Mike Dusenberry] Updating module docstring. cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data. 687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side. 3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed. 308f197 [Mike Dusenberry] Using properties for better documentation. 1633f86 [Mike Dusenberry] Minor documentation cleanup. f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix. ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner. 3fd4016 [Mike Dusenberry] Updating docstrings. 27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix. a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly. d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction. 4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry. c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions. 329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring. 0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests. c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted. 4ad6819 [Mike Dusenberry] Documenting the and parameters. 3b854b9 [Mike Dusenberry] Minor updates to documentation. 10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods. 119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic. 4d7af
spark git commit: [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
Repository: spark Updated Branches: refs/heads/master 1833d9c08 -> 571d5b536 [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark. This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now. Author: Mike Dusenberry Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits: bb039cb [Mike Dusenberry] Minor documentation update. b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included. 7f0dcb6 [Mike Dusenberry] Updating module docstring. cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data. 687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side. 3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed. 308f197 [Mike Dusenberry] Using properties for better documentation. 1633f86 [Mike Dusenberry] Minor documentation cleanup. f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix. ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner. 3fd4016 [Mike Dusenberry] Updating docstrings. 27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix. a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly. d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction. 4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry. c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions. 329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring. 0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests. c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted. 4ad6819 [Mike Dusenberry] Documenting the and parameters. 3b854b9 [Mike Dusenberry] Minor updates to documentation. 10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods. 119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic. 4d7af86 [
spark git commit: [SPARK-9582] [ML] LDA cleanups
Repository: spark Updated Branches: refs/heads/branch-1.5 e682ee254 -> fe4a4f41a [SPARK-9582] [ML] LDA cleanups Small cleanups to recent LDA additions and docs. CC: feynmanliang Author: Joseph K. Bradley Closes #7916 from jkbradley/lda-cleanups and squashes the following commits: f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local model and online learning 97947aa [Joseph K. Bradley] a few more cleanups 5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood c566915 [Joseph K. Bradley] small edit to make review easier 63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models (cherry picked from commit 1833d9c08f021d991334424d0a6d5ec21d1fccb2) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fe4a4f41 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fe4a4f41 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fe4a4f41 Branch: refs/heads/branch-1.5 Commit: fe4a4f41ad8b686455d58fc2fda9494e8dba5636 Parents: e682ee2 Author: Joseph K. Bradley Authored: Tue Aug 4 15:43:13 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 15:43:20 2015 -0700 -- .../spark/mllib/clustering/LDAModel.scala | 82 +++- .../spark/mllib/clustering/LDAOptimizer.scala | 19 +++-- 2 files changed, 58 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fe4a4f41/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala index 6af90d7..33babda 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala @@ -27,6 +27,7 @@ import org.json4s.jackson.JsonMethods._ import org.apache.spark.SparkContext import org.apache.spark.annotation.Experimental import org.apache.spark.api.java.JavaPairRDD +import org.apache.spark.broadcast.Broadcast import org.apache.spark.graphx.{Edge, EdgeContext, Graph, VertexId} import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors} import org.apache.spark.mllib.util.{Loader, Saveable} @@ -217,26 +218,28 @@ class LocalLDAModel private[clustering] ( // TODO: declare in LDAModel and override once implemented in DistributedLDAModel /** * Calculates a lower bound on the log likelihood of the entire corpus. + * + * See Equation (16) in original Online LDA paper. + * * @param documents test corpus to use for calculating log likelihood * @return variational lower bound on the log likelihood of the entire corpus */ - def logLikelihood(documents: RDD[(Long, Vector)]): Double = bound(documents, + def logLikelihood(documents: RDD[(Long, Vector)]): Double = logLikelihoodBound(documents, docConcentration, topicConcentration, topicsMatrix.toBreeze.toDenseMatrix, gammaShape, k, vocabSize) /** - * Calculate an upper bound bound on perplexity. See Equation (16) in original Online - * LDA paper. + * Calculate an upper bound bound on perplexity. (Lower is better.) + * See Equation (16) in original Online LDA paper. + * * @param documents test corpus to use for calculating perplexity - * @return variational upper bound on log perplexity per word + * @return Variational upper bound on log perplexity per token. */ def logPerplexity(documents: RDD[(Long, Vector)]): Double = { -val corpusWords = documents +val corpusTokenCount = documents .map { case (_, termCounts) => termCounts.toArray.sum } .sum() -val perWordBound = -logLikelihood(documents) / corpusWords - -perWordBound +-logLikelihood(documents) / corpusTokenCount } /** @@ -244,17 +247,20 @@ class LocalLDAModel private[clustering] ( *log p(documents) >= E_q[log p(documents)] - E_q[log q(documents)] * This bound is derived by decomposing the LDA model to: *log p(documents) = E_q[log p(documents)] - E_q[log q(documents)] + D(q|p) - * and noting that the KL-divergence D(q|p) >= 0. See Equation (16) in original Online LDA paper. + * and noting that the KL-divergence D(q|p) >= 0. + * + * See Equation (16) in original Online LDA paper, as well as Appendix A.3 in the JMLR version of + * the original LDA paper. * @param documents a subset of the test corpus * @param alpha document-topic Dirichlet prior parameters - * @param eta topic-word Dirichlet prior parameters + * @param eta topic-word Dirichlet prior parameter * @param lambda parameters for variational q(beta | lambda) topic-word distributions * @param
spark git commit: [SPARK-9582] [ML] LDA cleanups
Repository: spark Updated Branches: refs/heads/master e37545606 -> 1833d9c08 [SPARK-9582] [ML] LDA cleanups Small cleanups to recent LDA additions and docs. CC: feynmanliang Author: Joseph K. Bradley Closes #7916 from jkbradley/lda-cleanups and squashes the following commits: f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local model and online learning 97947aa [Joseph K. Bradley] a few more cleanups 5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood c566915 [Joseph K. Bradley] small edit to make review easier 63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1833d9c0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1833d9c0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1833d9c0 Branch: refs/heads/master Commit: 1833d9c08f021d991334424d0a6d5ec21d1fccb2 Parents: e375456 Author: Joseph K. Bradley Authored: Tue Aug 4 15:43:13 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 15:43:13 2015 -0700 -- .../spark/mllib/clustering/LDAModel.scala | 82 +++- .../spark/mllib/clustering/LDAOptimizer.scala | 19 +++-- 2 files changed, 58 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1833d9c0/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala index 6af90d7..33babda 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala @@ -27,6 +27,7 @@ import org.json4s.jackson.JsonMethods._ import org.apache.spark.SparkContext import org.apache.spark.annotation.Experimental import org.apache.spark.api.java.JavaPairRDD +import org.apache.spark.broadcast.Broadcast import org.apache.spark.graphx.{Edge, EdgeContext, Graph, VertexId} import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors} import org.apache.spark.mllib.util.{Loader, Saveable} @@ -217,26 +218,28 @@ class LocalLDAModel private[clustering] ( // TODO: declare in LDAModel and override once implemented in DistributedLDAModel /** * Calculates a lower bound on the log likelihood of the entire corpus. + * + * See Equation (16) in original Online LDA paper. + * * @param documents test corpus to use for calculating log likelihood * @return variational lower bound on the log likelihood of the entire corpus */ - def logLikelihood(documents: RDD[(Long, Vector)]): Double = bound(documents, + def logLikelihood(documents: RDD[(Long, Vector)]): Double = logLikelihoodBound(documents, docConcentration, topicConcentration, topicsMatrix.toBreeze.toDenseMatrix, gammaShape, k, vocabSize) /** - * Calculate an upper bound bound on perplexity. See Equation (16) in original Online - * LDA paper. + * Calculate an upper bound bound on perplexity. (Lower is better.) + * See Equation (16) in original Online LDA paper. + * * @param documents test corpus to use for calculating perplexity - * @return variational upper bound on log perplexity per word + * @return Variational upper bound on log perplexity per token. */ def logPerplexity(documents: RDD[(Long, Vector)]): Double = { -val corpusWords = documents +val corpusTokenCount = documents .map { case (_, termCounts) => termCounts.toArray.sum } .sum() -val perWordBound = -logLikelihood(documents) / corpusWords - -perWordBound +-logLikelihood(documents) / corpusTokenCount } /** @@ -244,17 +247,20 @@ class LocalLDAModel private[clustering] ( *log p(documents) >= E_q[log p(documents)] - E_q[log q(documents)] * This bound is derived by decomposing the LDA model to: *log p(documents) = E_q[log p(documents)] - E_q[log q(documents)] + D(q|p) - * and noting that the KL-divergence D(q|p) >= 0. See Equation (16) in original Online LDA paper. + * and noting that the KL-divergence D(q|p) >= 0. + * + * See Equation (16) in original Online LDA paper, as well as Appendix A.3 in the JMLR version of + * the original LDA paper. * @param documents a subset of the test corpus * @param alpha document-topic Dirichlet prior parameters - * @param eta topic-word Dirichlet prior parameters + * @param eta topic-word Dirichlet prior parameter * @param lambda parameters for variational q(beta | lambda) topic-word distributions * @param gammaShape shape parameter for random initialization of variational q(theta | gamma) * t
spark git commit: [SPARK-9602] remove "Akka/Actor" words from comments
Repository: spark Updated Branches: refs/heads/branch-1.5 f771a83f4 -> 560b2da78 [SPARK-9602] remove "Akka/Actor" words from comments https://issues.apache.org/jira/browse/SPARK-9602 Although we have hidden Akka behind RPC interface, I found that the Akka/Actor-related comments are still spreading everywhere. To make it consistent, we shall remove "actor"/"akka" words from the comments... Author: CodingCat Closes #7936 from CodingCat/SPARK-9602 and squashes the following commits: e8296a3 [CodingCat] remove actor words from comments (cherry picked from commit 9d668b73687e697cad2ef7fd3c3ba405e9795593) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/560b2da7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/560b2da7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/560b2da7 Branch: refs/heads/branch-1.5 Commit: 560b2da783bc25bd8767f6888665dadecac916d8 Parents: f771a83 Author: CodingCat Authored: Tue Aug 4 14:54:11 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 14:54:30 2015 -0700 -- .../main/scala/org/apache/spark/api/python/PythonRDD.scala| 2 +- .../scala/org/apache/spark/deploy/LocalSparkCluster.scala | 4 .../main/scala/org/apache/spark/deploy/client/AppClient.scala | 2 +- .../org/apache/spark/deploy/master/LeaderElectionAgent.scala | 6 +++--- .../scala/org/apache/spark/deploy/master/MasterMessages.scala | 2 +- .../spark/deploy/master/ZooKeeperLeaderElectionAgent.scala| 6 +++--- .../main/scala/org/apache/spark/deploy/worker/Worker.scala| 7 --- .../scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 4 ++-- core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala | 2 +- .../org/apache/spark/scheduler/OutputCommitCoordinator.scala | 2 +- .../org/apache/spark/scheduler/cluster/ExecutorData.scala | 2 +- core/src/main/scala/org/apache/spark/util/IdGenerator.scala | 6 +++--- .../spark/deploy/master/CustomRecoveryModeFactory.scala | 4 ++-- .../org/apache/spark/deploy/worker/WorkerWatcherSuite.scala | 5 ++--- project/MimaExcludes.scala| 2 +- .../src/main/scala/org/apache/spark/repl/SparkILoop.scala | 2 +- 16 files changed, 27 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/560b2da7/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala index 55e563e..2a56bf2 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala @@ -794,7 +794,7 @@ private class PythonAccumulatorParam(@transient serverHost: String, serverPort: /** * We try to reuse a single Socket to transfer accumulator updates, as they are all added - * by the DAGScheduler's single-threaded actor anyway. + * by the DAGScheduler's single-threaded RpcEndpoint anyway. */ @transient var socket: Socket = _ http://git-wip-us.apache.org/repos/asf/spark/blob/560b2da7/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala b/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala index 53356ad..83ccaad 100644 --- a/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala +++ b/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala @@ -73,12 +73,8 @@ class LocalSparkCluster( def stop() { logInfo("Shutting down local Spark cluster.") // Stop the workers before the master so they don't get upset that it disconnected -// TODO: In Akka 2.1.x, ActorSystem.awaitTermination hangs when you have remote actors! -// This is unfortunate, but for now we just comment it out. workerRpcEnvs.foreach(_.shutdown()) -// workerActorSystems.foreach(_.awaitTermination()) masterRpcEnvs.foreach(_.shutdown()) -// masterActorSystems.foreach(_.awaitTermination()) masterRpcEnvs.clear() workerRpcEnvs.clear() } http://git-wip-us.apache.org/repos/asf/spark/blob/560b2da7/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala index 7576a29..25ea692 100644 --- a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala +++ b/core/src/main/scala/org/apache/spark/de
spark git commit: [SPARK-9602] remove "Akka/Actor" words from comments
Repository: spark Updated Branches: refs/heads/master ab8ee1a3b -> 9d668b736 [SPARK-9602] remove "Akka/Actor" words from comments https://issues.apache.org/jira/browse/SPARK-9602 Although we have hidden Akka behind RPC interface, I found that the Akka/Actor-related comments are still spreading everywhere. To make it consistent, we shall remove "actor"/"akka" words from the comments... Author: CodingCat Closes #7936 from CodingCat/SPARK-9602 and squashes the following commits: e8296a3 [CodingCat] remove actor words from comments Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d668b73 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d668b73 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d668b73 Branch: refs/heads/master Commit: 9d668b73687e697cad2ef7fd3c3ba405e9795593 Parents: ab8ee1a Author: CodingCat Authored: Tue Aug 4 14:54:11 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 14:54:11 2015 -0700 -- .../main/scala/org/apache/spark/api/python/PythonRDD.scala| 2 +- .../scala/org/apache/spark/deploy/LocalSparkCluster.scala | 4 .../main/scala/org/apache/spark/deploy/client/AppClient.scala | 2 +- .../org/apache/spark/deploy/master/LeaderElectionAgent.scala | 6 +++--- .../scala/org/apache/spark/deploy/master/MasterMessages.scala | 2 +- .../spark/deploy/master/ZooKeeperLeaderElectionAgent.scala| 6 +++--- .../main/scala/org/apache/spark/deploy/worker/Worker.scala| 7 --- .../scala/org/apache/spark/deploy/worker/WorkerWatcher.scala | 4 ++-- core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala | 2 +- .../org/apache/spark/scheduler/OutputCommitCoordinator.scala | 2 +- .../org/apache/spark/scheduler/cluster/ExecutorData.scala | 2 +- core/src/main/scala/org/apache/spark/util/IdGenerator.scala | 6 +++--- .../spark/deploy/master/CustomRecoveryModeFactory.scala | 4 ++-- .../org/apache/spark/deploy/worker/WorkerWatcherSuite.scala | 5 ++--- project/MimaExcludes.scala| 2 +- .../src/main/scala/org/apache/spark/repl/SparkILoop.scala | 2 +- 16 files changed, 27 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9d668b73/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala -- diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala index 55e563e..2a56bf2 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala @@ -794,7 +794,7 @@ private class PythonAccumulatorParam(@transient serverHost: String, serverPort: /** * We try to reuse a single Socket to transfer accumulator updates, as they are all added - * by the DAGScheduler's single-threaded actor anyway. + * by the DAGScheduler's single-threaded RpcEndpoint anyway. */ @transient var socket: Socket = _ http://git-wip-us.apache.org/repos/asf/spark/blob/9d668b73/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala b/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala index 53356ad..83ccaad 100644 --- a/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala +++ b/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala @@ -73,12 +73,8 @@ class LocalSparkCluster( def stop() { logInfo("Shutting down local Spark cluster.") // Stop the workers before the master so they don't get upset that it disconnected -// TODO: In Akka 2.1.x, ActorSystem.awaitTermination hangs when you have remote actors! -// This is unfortunate, but for now we just comment it out. workerRpcEnvs.foreach(_.shutdown()) -// workerActorSystems.foreach(_.awaitTermination()) masterRpcEnvs.foreach(_.shutdown()) -// masterActorSystems.foreach(_.awaitTermination()) masterRpcEnvs.clear() workerRpcEnvs.clear() } http://git-wip-us.apache.org/repos/asf/spark/blob/9d668b73/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala index 7576a29..25ea692 100644 --- a/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala +++ b/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala @@ -257,7 +257,7 @@ private[spark] class AppClient( } def start() { -
spark git commit: [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier
Repository: spark Updated Branches: refs/heads/master 9d668b736 -> e37545606 [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier, plus doc tests for those columns. CC: holdenk yanboliang Author: Joseph K. Bradley Closes #7903 from jkbradley/rf-prob-python and squashes the following commits: c62a83f [Joseph K. Bradley] made unit test more robust 14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier in PySpark Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e3754560 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e3754560 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e3754560 Branch: refs/heads/master Commit: e375456063617cd7000d796024f41e5927f21edd Parents: 9d668b7 Author: Joseph K. Bradley Authored: Tue Aug 4 14:54:26 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 14:54:26 2015 -0700 -- python/pyspark/ml/classification.py | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e3754560/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 291320f..5978d8f 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -347,6 +347,7 @@ class DecisionTreeClassificationModel(DecisionTreeModel): @inherit_doc class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasSeed, + HasRawPredictionCol, HasProbabilityCol, DecisionTreeParams, HasCheckpointInterval): """ `http://en.wikipedia.org/wiki/Random_forest Random Forest` @@ -354,6 +355,7 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred It supports both binary and multiclass labels, as well as both continuous and categorical features. +>>> import numpy >>> from numpy import allclose >>> from pyspark.mllib.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer @@ -368,8 +370,13 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred >>> allclose(model.treeWeights, [1.0, 1.0, 1.0]) True >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) ->>> model.transform(test0).head().prediction +>>> result = model.transform(test0).head() +>>> result.prediction 0.0 +>>> numpy.argmax(result.probability) +0 +>>> numpy.argmax(result.rawPrediction) +0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0 @@ -390,11 +397,13 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", + probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", numTrees=20, featureSubsetStrategy="auto", seed=None): """ __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ + probabilityCol="probability", rawPredictionCol="rawPrediction", \ maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, \ maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", \ numTrees=20, featureSubsetStrategy="auto", seed=None) @@ -427,11 +436,13 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred @keyword_only def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", + probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, seed=None, impurity="gini", numTrees=20, featureSubsetStrategy="auto"): """ setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ + probabilityCol="probability", rawPredictionCol="rawPrediction", \ maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, \
spark git commit: [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier
Repository: spark Updated Branches: refs/heads/branch-1.5 560b2da78 -> e682ee254 [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier, plus doc tests for those columns. CC: holdenk yanboliang Author: Joseph K. Bradley Closes #7903 from jkbradley/rf-prob-python and squashes the following commits: c62a83f [Joseph K. Bradley] made unit test more robust 14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier in PySpark (cherry picked from commit e375456063617cd7000d796024f41e5927f21edd) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e682ee25 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e682ee25 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e682ee25 Branch: refs/heads/branch-1.5 Commit: e682ee25477374737f3b1dfc08c98829564b26d4 Parents: 560b2da Author: Joseph K. Bradley Authored: Tue Aug 4 14:54:26 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 14:54:34 2015 -0700 -- python/pyspark/ml/classification.py | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e682ee25/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 291320f..5978d8f 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -347,6 +347,7 @@ class DecisionTreeClassificationModel(DecisionTreeModel): @inherit_doc class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasSeed, + HasRawPredictionCol, HasProbabilityCol, DecisionTreeParams, HasCheckpointInterval): """ `http://en.wikipedia.org/wiki/Random_forest Random Forest` @@ -354,6 +355,7 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred It supports both binary and multiclass labels, as well as both continuous and categorical features. +>>> import numpy >>> from numpy import allclose >>> from pyspark.mllib.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer @@ -368,8 +370,13 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred >>> allclose(model.treeWeights, [1.0, 1.0, 1.0]) True >>> test0 = sqlContext.createDataFrame([(Vectors.dense(-1.0),)], ["features"]) ->>> model.transform(test0).head().prediction +>>> result = model.transform(test0).head() +>>> result.prediction 0.0 +>>> numpy.argmax(result.probability) +0 +>>> numpy.argmax(result.rawPrediction) +0 >>> test1 = sqlContext.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"]) >>> model.transform(test1).head().prediction 1.0 @@ -390,11 +397,13 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", + probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", numTrees=20, featureSubsetStrategy="auto", seed=None): """ __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ + probabilityCol="probability", rawPredictionCol="rawPrediction", \ maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, \ maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", \ numTrees=20, featureSubsetStrategy="auto", seed=None) @@ -427,11 +436,13 @@ class RandomForestClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPred @keyword_only def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", + probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, seed=None, impurity="gini", numTrees=20, featureSubsetStrategy="auto"): """ setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ + probabilityCol="probability", rawPredictionCol="rawPredicti
spark git commit: [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter
Repository: spark Updated Branches: refs/heads/branch-1.5 43f6b021e -> f771a83f4 [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages. Author: Josh Rosen Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits: 967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 948c344 [Josh Rosen] Add large records tests for KV sorter. 3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method 380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite a49baef [Josh Rosen] Address initial round of review comments 3edb931 [Josh Rosen] Remove accidentally-committed debug statements. 2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter. (cherry picked from commit ab8ee1a3b93286a62949569615086ef5030e9fae) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f771a83f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f771a83f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f771a83f Branch: refs/heads/branch-1.5 Commit: f771a83f4090e979f72d01989e6693d7fbc05c05 Parents: 43f6b02 Author: Josh Rosen Authored: Tue Aug 4 14:42:11 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 14:42:20 2015 -0700 -- .../unsafe/sort/UnsafeExternalSorter.java | 173 ++- .../unsafe/PackedRecordPointerSuite.java| 8 +- .../unsafe/sort/UnsafeExternalSorterSuite.java | 129 +--- .../sql/execution/UnsafeExternalRowSorter.java | 2 +- .../sql/execution/UnsafeKVExternalSorter.java | 11 +- .../execution/UnsafeKVExternalSorterSuite.scala | 210 --- .../unsafe/memory/HeapMemoryAllocator.java | 3 + .../unsafe/memory/UnsafeMemoryAllocator.java| 3 + 8 files changed, 372 insertions(+), 167 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f771a83f/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index dec7fcf..e6ddd08 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -34,6 +34,7 @@ import org.apache.spark.TaskContext; import org.apache.spark.executor.ShuffleWriteMetrics; import org.apache.spark.shuffle.ShuffleMemoryManager; import org.apache.spark.storage.BlockManager; +import org.apache.spark.unsafe.array.ByteArrayMethods; import org.apache.spark.unsafe.PlatformDependent; import org.apache.spark.unsafe.memory.MemoryBlock; import org.apache.spark.unsafe.memory.TaskMemoryManager; @@ -143,8 +144,7 @@ public final class UnsafeExternalSorter { taskContext.addOnCompleteCallback(new AbstractFunction0() { @Override public BoxedUnit apply() { -deleteSpillFiles(); -freeMemory(); +cleanupResources(); return null; } }); @@ -249,7 +249,7 @@ public final class UnsafeExternalSorter { * * @return the number of bytes freed. */ - public long freeMemory() { + private long freeMemory() { updatePeakMemoryUsed(); long memoryFreed = 0; for (MemoryBlock block : allocatedPages) { @@ -275,44 +275,32 @@ public final class UnsafeExternalSorter { /** * Deletes any spill files created by this sorter. */ - public void deleteSpillFiles() { + private void deleteSpillFiles() { for (UnsafeSorterSpillWriter spill : spillWriters) { File file = spill.getFile(); if (file != null && file.exists()) { if (!file.delete()) { logger.error("Was unable to delete spill file {}", file.getAbsolutePath()); -}; +} } } } /** - * Checks whether there is enough space to insert a new record into the sorter. - * - * @param requiredSpace the required space in the data page, in bytes, including space for storing - * the record size. - - * @return true if the record can be inserted without requiring more allocations, false otherwise. + * Frees this sorter's in-memory data structures and cleans up its spill files. */ - private boolean haveSpaceForRecord(in
spark git commit: [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter
Repository: spark Updated Branches: refs/heads/master f4b1ac08a -> ab8ee1a3b [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages. Author: Josh Rosen Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits: 967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 948c344 [Josh Rosen] Add large records tests for KV sorter. 3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method 380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite a49baef [Josh Rosen] Address initial round of review comments 3edb931 [Josh Rosen] Remove accidentally-committed debug statements. 2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab8ee1a3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab8ee1a3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab8ee1a3 Branch: refs/heads/master Commit: ab8ee1a3b93286a62949569615086ef5030e9fae Parents: f4b1ac0 Author: Josh Rosen Authored: Tue Aug 4 14:42:11 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 14:42:11 2015 -0700 -- .../unsafe/sort/UnsafeExternalSorter.java | 173 ++- .../unsafe/PackedRecordPointerSuite.java| 8 +- .../unsafe/sort/UnsafeExternalSorterSuite.java | 129 +--- .../sql/execution/UnsafeExternalRowSorter.java | 2 +- .../sql/execution/UnsafeKVExternalSorter.java | 11 +- .../execution/UnsafeKVExternalSorterSuite.scala | 210 --- .../unsafe/memory/HeapMemoryAllocator.java | 3 + .../unsafe/memory/UnsafeMemoryAllocator.java| 3 + 8 files changed, 372 insertions(+), 167 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ab8ee1a3/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index dec7fcf..e6ddd08 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -34,6 +34,7 @@ import org.apache.spark.TaskContext; import org.apache.spark.executor.ShuffleWriteMetrics; import org.apache.spark.shuffle.ShuffleMemoryManager; import org.apache.spark.storage.BlockManager; +import org.apache.spark.unsafe.array.ByteArrayMethods; import org.apache.spark.unsafe.PlatformDependent; import org.apache.spark.unsafe.memory.MemoryBlock; import org.apache.spark.unsafe.memory.TaskMemoryManager; @@ -143,8 +144,7 @@ public final class UnsafeExternalSorter { taskContext.addOnCompleteCallback(new AbstractFunction0() { @Override public BoxedUnit apply() { -deleteSpillFiles(); -freeMemory(); +cleanupResources(); return null; } }); @@ -249,7 +249,7 @@ public final class UnsafeExternalSorter { * * @return the number of bytes freed. */ - public long freeMemory() { + private long freeMemory() { updatePeakMemoryUsed(); long memoryFreed = 0; for (MemoryBlock block : allocatedPages) { @@ -275,44 +275,32 @@ public final class UnsafeExternalSorter { /** * Deletes any spill files created by this sorter. */ - public void deleteSpillFiles() { + private void deleteSpillFiles() { for (UnsafeSorterSpillWriter spill : spillWriters) { File file = spill.getFile(); if (file != null && file.exists()) { if (!file.delete()) { logger.error("Was unable to delete spill file {}", file.getAbsolutePath()); -}; +} } } } /** - * Checks whether there is enough space to insert a new record into the sorter. - * - * @param requiredSpace the required space in the data page, in bytes, including space for storing - * the record size. - - * @return true if the record can be inserted without requiring more allocations, false otherwise. + * Frees this sorter's in-memory data structures and cleans up its spill files. */ - private boolean haveSpaceForRecord(int requiredSpace) { -assert(requiredSpace > 0); -assert(inMemSorter != null); -return (inMemSor
spark git commit: [SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage
Repository: spark Updated Branches: refs/heads/master a0cc01759 -> f4b1ac08a [SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage Author: Wenchen Fan Closes #7890 from cloud-fan/minor and squashes the following commits: c3b1be3 [Wenchen Fan] fix style b0cbe2e [Wenchen Fan] remove the createCode and createStructCode, and replace the usage of them by createStructCode Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4b1ac08 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4b1ac08 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4b1ac08 Branch: refs/heads/master Commit: f4b1ac08a1327e6d0ddc317cdf3997a0f68dec72 Parents: a0cc017 Author: Wenchen Fan Authored: Tue Aug 4 14:40:46 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 14:40:46 2015 -0700 -- .../codegen/GenerateUnsafeProjection.scala | 161 ++- .../expressions/complexTypeCreator.scala| 10 +- 2 files changed, 17 insertions(+), 154 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f4b1ac08/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala index fc3ecf5..71f8ea0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala @@ -117,161 +117,13 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro } /** - * Generates the code to create an [[UnsafeRow]] object based on the input expressions. - * @param ctx context for code generation - * @param ev specifies the name of the variable for the output [[UnsafeRow]] object - * @param expressions input expressions - * @return generated code to put the expression output into an [[UnsafeRow]] - */ - def createCode(ctx: CodeGenContext, ev: GeneratedExpressionCode, expressions: Seq[Expression]) -: String = { - -val ret = ev.primitive -ctx.addMutableState("UnsafeRow", ret, s"$ret = new UnsafeRow();") -val buffer = ctx.freshName("buffer") -ctx.addMutableState("byte[]", buffer, s"$buffer = new byte[64];") -val cursor = ctx.freshName("cursor") -val numBytes = ctx.freshName("numBytes") - -val exprs = expressions.map { e => e.dataType match { - case st: StructType => createCodeForStruct(ctx, e.gen(ctx), st) - case _ => e.gen(ctx) -}} -val allExprs = exprs.map(_.code).mkString("\n") - -val fixedSize = 8 * exprs.length + UnsafeRow.calculateBitSetWidthInBytes(exprs.length) -val additionalSize = expressions.zipWithIndex.map { - case (e, i) => genAdditionalSize(e.dataType, exprs(i)) -}.mkString("") - -val writers = expressions.zipWithIndex.map { case (e, i) => - val update = genFieldWriter(ctx, e.dataType, exprs(i), ret, i, cursor) - s"""if (${exprs(i).isNull}) { -$ret.setNullAt($i); - } else { -$update; - }""" -}.mkString("\n ") - -s""" - $allExprs - int $numBytes = $fixedSize $additionalSize; - if ($numBytes > $buffer.length) { -$buffer = new byte[$numBytes]; - } - - $ret.pointTo( -$buffer, -$PlatformDependent.BYTE_ARRAY_OFFSET, -${expressions.size}, -$numBytes); - int $cursor = $fixedSize; - - $writers - boolean ${ev.isNull} = false; - """ - } - - /** - * Generates the Java code to convert a struct (backed by InternalRow) to UnsafeRow. - * - * This function also handles nested structs by recursively generating the code to do conversion. - * - * @param ctx code generation context - * @param input the input struct, identified by a [[GeneratedExpressionCode]] - * @param schema schema of the struct field - */ - // TODO: refactor createCode and this function to reduce code duplication. - private def createCodeForStruct( - ctx: CodeGenContext, - input: GeneratedExpressionCode, - schema: StructType): GeneratedExpressionCode = { - -val isNull = input.isNull -val primitive = ctx.freshName("structConvert") -ctx.addMutableState("UnsafeRow", primitive, s"$primitive = new UnsafeRow();") -val buffer = ctx.freshName("buffer") -ctx.addMutableState("byte[]", buffer, s"$buffer = new byte[64];") -val cursor = ctx.freshName("cursor") - -va
spark git commit: [SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage
Repository: spark Updated Branches: refs/heads/branch-1.5 be37b1bd3 -> 43f6b021e [SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage Author: Wenchen Fan Closes #7890 from cloud-fan/minor and squashes the following commits: c3b1be3 [Wenchen Fan] fix style b0cbe2e [Wenchen Fan] remove the createCode and createStructCode, and replace the usage of them by createStructCode (cherry picked from commit f4b1ac08a1327e6d0ddc317cdf3997a0f68dec72) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/43f6b021 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/43f6b021 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/43f6b021 Branch: refs/heads/branch-1.5 Commit: 43f6b021e5f14b9126e4291f989a076085367c2c Parents: be37b1b Author: Wenchen Fan Authored: Tue Aug 4 14:40:46 2015 -0700 Committer: Reynold Xin Committed: Tue Aug 4 14:40:53 2015 -0700 -- .../codegen/GenerateUnsafeProjection.scala | 161 ++- .../expressions/complexTypeCreator.scala| 10 +- 2 files changed, 17 insertions(+), 154 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/43f6b021/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala index fc3ecf5..71f8ea0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala @@ -117,161 +117,13 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro } /** - * Generates the code to create an [[UnsafeRow]] object based on the input expressions. - * @param ctx context for code generation - * @param ev specifies the name of the variable for the output [[UnsafeRow]] object - * @param expressions input expressions - * @return generated code to put the expression output into an [[UnsafeRow]] - */ - def createCode(ctx: CodeGenContext, ev: GeneratedExpressionCode, expressions: Seq[Expression]) -: String = { - -val ret = ev.primitive -ctx.addMutableState("UnsafeRow", ret, s"$ret = new UnsafeRow();") -val buffer = ctx.freshName("buffer") -ctx.addMutableState("byte[]", buffer, s"$buffer = new byte[64];") -val cursor = ctx.freshName("cursor") -val numBytes = ctx.freshName("numBytes") - -val exprs = expressions.map { e => e.dataType match { - case st: StructType => createCodeForStruct(ctx, e.gen(ctx), st) - case _ => e.gen(ctx) -}} -val allExprs = exprs.map(_.code).mkString("\n") - -val fixedSize = 8 * exprs.length + UnsafeRow.calculateBitSetWidthInBytes(exprs.length) -val additionalSize = expressions.zipWithIndex.map { - case (e, i) => genAdditionalSize(e.dataType, exprs(i)) -}.mkString("") - -val writers = expressions.zipWithIndex.map { case (e, i) => - val update = genFieldWriter(ctx, e.dataType, exprs(i), ret, i, cursor) - s"""if (${exprs(i).isNull}) { -$ret.setNullAt($i); - } else { -$update; - }""" -}.mkString("\n ") - -s""" - $allExprs - int $numBytes = $fixedSize $additionalSize; - if ($numBytes > $buffer.length) { -$buffer = new byte[$numBytes]; - } - - $ret.pointTo( -$buffer, -$PlatformDependent.BYTE_ARRAY_OFFSET, -${expressions.size}, -$numBytes); - int $cursor = $fixedSize; - - $writers - boolean ${ev.isNull} = false; - """ - } - - /** - * Generates the Java code to convert a struct (backed by InternalRow) to UnsafeRow. - * - * This function also handles nested structs by recursively generating the code to do conversion. - * - * @param ctx code generation context - * @param input the input struct, identified by a [[GeneratedExpressionCode]] - * @param schema schema of the struct field - */ - // TODO: refactor createCode and this function to reduce code duplication. - private def createCodeForStruct( - ctx: CodeGenContext, - input: GeneratedExpressionCode, - schema: StructType): GeneratedExpressionCode = { - -val isNull = input.isNull -val primitive = ctx.freshName("structConvert") -ctx.addMutableState("UnsafeRow", primitive, s"$primitive = new UnsafeRow();") -val buffer = ctx.freshName("buffer") -ctx.addMuta
spark git commit: [SPARK-9606] [SQL] Ignore flaky thrift server tests
Repository: spark Updated Branches: refs/heads/branch-1.5 c5250ddc5 -> be37b1bd3 [SPARK-9606] [SQL] Ignore flaky thrift server tests Author: Michael Armbrust Closes #7939 from marmbrus/turnOffThriftTests and squashes the following commits: 80d618e [Michael Armbrust] [SPARK-9606][SQL] Ignore flaky thrift server tests (cherry picked from commit a0cc01759b0c2cecf340c885d391976eb4e3fad6) Signed-off-by: Michael Armbrust Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/be37b1bd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/be37b1bd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/be37b1bd Branch: refs/heads/branch-1.5 Commit: be37b1bd3edd8583180dc1a41ecf4d80990216c7 Parents: c5250dd Author: Michael Armbrust Authored: Tue Aug 4 12:19:52 2015 -0700 Committer: Michael Armbrust Committed: Tue Aug 4 12:20:14 2015 -0700 -- .../spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala| 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/be37b1bd/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala -- diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala index 8374629..17e7044 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala @@ -38,7 +38,7 @@ import org.apache.hive.service.cli.thrift.TCLIService.Client import org.apache.hive.service.cli.thrift.ThriftCLIServiceClient import org.apache.thrift.protocol.TBinaryProtocol import org.apache.thrift.transport.TSocket -import org.scalatest.BeforeAndAfterAll +import org.scalatest.{Ignore, BeforeAndAfterAll} import org.apache.spark.{Logging, SparkFunSuite} import org.apache.spark.sql.hive.HiveContext @@ -53,6 +53,7 @@ object TestData { val smallKvWithNull = getTestDataFilePath("small_kv_with_null.txt") } +@Ignore // SPARK-9606 class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { override def mode: ServerMode.Value = ServerMode.binary @@ -379,6 +380,7 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { } } +@Ignore // SPARK-9606 class HiveThriftHttpServerSuite extends HiveThriftJdbcTest { override def mode: ServerMode.Value = ServerMode.http - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9606] [SQL] Ignore flaky thrift server tests
Repository: spark Updated Branches: refs/heads/master 5a23213c1 -> a0cc01759 [SPARK-9606] [SQL] Ignore flaky thrift server tests Author: Michael Armbrust Closes #7939 from marmbrus/turnOffThriftTests and squashes the following commits: 80d618e [Michael Armbrust] [SPARK-9606][SQL] Ignore flaky thrift server tests Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0cc0175 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0cc0175 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0cc0175 Branch: refs/heads/master Commit: a0cc01759b0c2cecf340c885d391976eb4e3fad6 Parents: 5a23213 Author: Michael Armbrust Authored: Tue Aug 4 12:19:52 2015 -0700 Committer: Michael Armbrust Committed: Tue Aug 4 12:19:52 2015 -0700 -- .../spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala| 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a0cc0175/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala -- diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala index 8374629..17e7044 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala @@ -38,7 +38,7 @@ import org.apache.hive.service.cli.thrift.TCLIService.Client import org.apache.hive.service.cli.thrift.ThriftCLIServiceClient import org.apache.thrift.protocol.TBinaryProtocol import org.apache.thrift.transport.TSocket -import org.scalatest.BeforeAndAfterAll +import org.scalatest.{Ignore, BeforeAndAfterAll} import org.apache.spark.{Logging, SparkFunSuite} import org.apache.spark.sql.hive.HiveContext @@ -53,6 +53,7 @@ object TestData { val smallKvWithNull = getTestDataFilePath("small_kv_with_null.txt") } +@Ignore // SPARK-9606 class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { override def mode: ServerMode.Value = ServerMode.binary @@ -379,6 +380,7 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { } } +@Ignore // SPARK-9606 class HiveThriftHttpServerSuite extends HiveThriftJdbcTest { override def mode: ServerMode.Value = ServerMode.http - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier
Repository: spark Updated Branches: refs/heads/master 34a0eb2e8 -> 5a23213c1 [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier This PR replaces the old "threshold" with a generalized "thresholds" Param. We keep getThreshold,setThreshold for backwards compatibility for binary classification. Note that the primary author of this PR is holdenk Author: Holden Karau Author: Joseph K. Bradley Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits: 3952977 [Joseph K. Bradley] fixed pyspark doc test 85febc8 [Joseph K. Bradley] made python unit tests a little more robust 7eb1d86 [Joseph K. Bradley] small cleanups 6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues. 0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests 7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc. 6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests 25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression c02d6c0 [Holden Karau] No default for thresholds 5e43628 [Holden Karau] CR feedback and fixed the renamed test f3fbbd1 [Holden Karau] revert the changes to random forest :( 51f581c [Holden Karau] Add explicit types to public methods, fix long line f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic 398078a [Holden Karau] move the thresholding around a bunch based on the design doc 4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok) 638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test e09919c [Holden Karau] Fix return type, I need more coffee 8d92cac [Holden Karau] Use ClassifierParams as the head 3456ed3 [Holden Karau] Add explicit return types even though just test a0f3b0c [Holden Karau] scala style fixes 6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now ffc8dab [Holden Karau] Update the sharedParams 0420290 [Holden Karau] Allow us to override the get methods selectively 978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions 1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there" 1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there efb9084 [Holden Karau] move setThresholds only to where its used 6b34809 [Holden Karau] Add a test with thresholding for the RFCS 74f54c3 [Holden Karau] Fix creation of vote array 1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down. 2f44b18 [Holden Karau] Add a global default of null for thresholds param f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds" 634b06f [Holden Karau] Some progress towards unifying threshold and thresholds 85c9e01 [Holden Karau] Test passes again... little fnur 099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer) 0f46836 [Holden Karau] Start adding a classifiersuite f70eb5e [Holden Karau] Fix test compile issues a7d59c8 [Holden Karau] Move thresholding into Classifier trait 5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test) 1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation 31d6bf2 [Holden Karau] Start threading the threshold info through 0ef228c [Holden Karau] Add hasthresholds Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5a23213c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5a23213c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5a23213c Branch: refs/heads/master Commit: 5a23213c148bfe362514f9c71f5273ebda0a848a Parents: 34a0eb2 Author: Holden Karau Authored: Tue Aug 4 10:12:22 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 10:12:22 2015 -0700 -- .../examples/ml/JavaSimpleParamsExample.java| 3 +- .../src/main/python/ml/simple_params_example.py | 2 +- .../spark/examples/ml/SimpleParamsExample.scala | 2 +- .../spark/ml/classification/Classifier.scala| 3 +- .../ml/classification/LogisticRegression.scala | 47 +++-- .../ProbabilisticCla
spark git commit: [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier
Repository: spark Updated Branches: refs/heads/branch-1.5 a9277cd5a -> c5250ddc5 [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier This PR replaces the old "threshold" with a generalized "thresholds" Param. We keep getThreshold,setThreshold for backwards compatibility for binary classification. Note that the primary author of this PR is holdenk Author: Holden Karau Author: Joseph K. Bradley Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits: 3952977 [Joseph K. Bradley] fixed pyspark doc test 85febc8 [Joseph K. Bradley] made python unit tests a little more robust 7eb1d86 [Joseph K. Bradley] small cleanups 6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues. 0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests 7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc. 6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests 25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression c02d6c0 [Holden Karau] No default for thresholds 5e43628 [Holden Karau] CR feedback and fixed the renamed test f3fbbd1 [Holden Karau] revert the changes to random forest :( 51f581c [Holden Karau] Add explicit types to public methods, fix long line f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic 398078a [Holden Karau] move the thresholding around a bunch based on the design doc 4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok) 638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test e09919c [Holden Karau] Fix return type, I need more coffee 8d92cac [Holden Karau] Use ClassifierParams as the head 3456ed3 [Holden Karau] Add explicit return types even though just test a0f3b0c [Holden Karau] scala style fixes 6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now ffc8dab [Holden Karau] Update the sharedParams 0420290 [Holden Karau] Allow us to override the get methods selectively 978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions 1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there" 1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there efb9084 [Holden Karau] move setThresholds only to where its used 6b34809 [Holden Karau] Add a test with thresholding for the RFCS 74f54c3 [Holden Karau] Fix creation of vote array 1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down. 2f44b18 [Holden Karau] Add a global default of null for thresholds param f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds" 634b06f [Holden Karau] Some progress towards unifying threshold and thresholds 85c9e01 [Holden Karau] Test passes again... little fnur 099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer) 0f46836 [Holden Karau] Start adding a classifiersuite f70eb5e [Holden Karau] Fix test compile issues a7d59c8 [Holden Karau] Move thresholding into Classifier trait 5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test) 1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation 31d6bf2 [Holden Karau] Start threading the threshold info through 0ef228c [Holden Karau] Add hasthresholds (cherry picked from commit 5a23213c148bfe362514f9c71f5273ebda0a848a) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c5250ddc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c5250ddc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c5250ddc Branch: refs/heads/branch-1.5 Commit: c5250ddc5242a071549e980f69fa8bd785168979 Parents: a9277cd Author: Holden Karau Authored: Tue Aug 4 10:12:22 2015 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 4 10:12:33 2015 -0700 -- .../examples/ml/JavaSimpleParamsExample.java| 3 +- .../src/main/python/ml/simple_params_example.py | 2 +- .../spark/examples/ml/SimpleParamsExample.scala | 2 +- .../spark/ml/classification/Class
spark git commit: [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting
Repository: spark Updated Branches: refs/heads/branch-1.5 aa8390dfc -> a9277cd5a [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting The analysis rule has a bug and we ended up making the sorter still capable of doing evaluation, so lets revert this for now. Author: Michael Armbrust Closes #7906 from marmbrus/revertSortProjection and squashes the following commits: 2da6972 [Michael Armbrust] unrevert unrelated changes 4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by expressions which still need evaluation" (cherry picked from commit 34a0eb2e89d59b0823efc035ddf2dc93f19540c1) Signed-off-by: Michael Armbrust Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a9277cd5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a9277cd5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a9277cd5 Branch: refs/heads/branch-1.5 Commit: a9277cd5aedd570f550e2a807768c8ffada9576f Parents: aa8390d Author: Michael Armbrust Authored: Tue Aug 4 10:07:53 2015 -0700 Committer: Michael Armbrust Committed: Tue Aug 4 10:08:03 2015 -0700 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 58 .../catalyst/plans/logical/basicOperators.scala | 13 ++--- .../sql/catalyst/analysis/AnalysisSuite.scala | 36 ++-- 3 files changed, 8 insertions(+), 99 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a9277cd5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index f5daba1..ca17f3e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -79,7 +79,6 @@ class Analyzer( ExtractWindowExpressions :: GlobalAggregates :: UnresolvedHavingClauseAttributes :: - RemoveEvaluationFromSort :: HiveTypeCoercion.typeCoercionRules ++ extendedResolutionRules : _*), Batch("Nondeterministic", Once, @@ -955,63 +954,6 @@ class Analyzer( Project(p.output, newPlan.withNewChildren(newChild :: Nil)) } } - - /** - * Removes all still-need-evaluate ordering expressions from sort and use an inner project to - * materialize them, finally use a outer project to project them away to keep the result same. - * Then we can make sure we only sort by [[AttributeReference]]s. - * - * As an example, - * {{{ - * Sort('a, 'b + 1, - * Relation('a, 'b)) - * }}} - * will be turned into: - * {{{ - * Project('a, 'b, - * Sort('a, '_sortCondition, - * Project('a, 'b, ('b + 1).as("_sortCondition"), - * Relation('a, 'b - * }}} - */ - object RemoveEvaluationFromSort extends Rule[LogicalPlan] { -private def hasAlias(expr: Expression) = { - expr.find { -case a: Alias => true -case _ => false - }.isDefined -} - -override def apply(plan: LogicalPlan): LogicalPlan = plan transform { - // The ordering expressions have no effect to the output schema of `Sort`, - // so `Alias`s in ordering expressions are unnecessary and we should remove them. - case s @ Sort(ordering, _, _) if ordering.exists(hasAlias) => -val newOrdering = ordering.map(_.transformUp { - case Alias(child, _) => child -}.asInstanceOf[SortOrder]) -s.copy(order = newOrdering) - - case s @ Sort(ordering, global, child) -if s.expressions.forall(_.resolved) && s.childrenResolved && !s.hasNoEvaluation => - -val (ref, needEval) = ordering.partition(_.child.isInstanceOf[AttributeReference]) - -val namedExpr = needEval.map(_.child match { - case n: NamedExpression => n - case e => Alias(e, "_sortCondition")() -}) - -val newOrdering = ref ++ needEval.zip(namedExpr).map { case (order, ne) => - order.copy(child = ne.toAttribute) -} - -// Add still-need-evaluate ordering expressions into inner project and then project -// them away after the sort. -Project(child.output, - Sort(newOrdering, global, -Project(child.output ++ namedExpr, child))) -} - } } /** http://git-wip-us.apache.org/repos/asf/spark/blob/a9277cd5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scal
spark git commit: [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting
Repository: spark Updated Branches: refs/heads/master 6a0f8b994 -> 34a0eb2e8 [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting The analysis rule has a bug and we ended up making the sorter still capable of doing evaluation, so lets revert this for now. Author: Michael Armbrust Closes #7906 from marmbrus/revertSortProjection and squashes the following commits: 2da6972 [Michael Armbrust] unrevert unrelated changes 4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by expressions which still need evaluation" Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34a0eb2e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34a0eb2e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34a0eb2e Branch: refs/heads/master Commit: 34a0eb2e89d59b0823efc035ddf2dc93f19540c1 Parents: 6a0f8b9 Author: Michael Armbrust Authored: Tue Aug 4 10:07:53 2015 -0700 Committer: Michael Armbrust Committed: Tue Aug 4 10:07:53 2015 -0700 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 58 .../catalyst/plans/logical/basicOperators.scala | 13 ++--- .../sql/catalyst/analysis/AnalysisSuite.scala | 36 ++-- 3 files changed, 8 insertions(+), 99 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/34a0eb2e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index f5daba1..ca17f3e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -79,7 +79,6 @@ class Analyzer( ExtractWindowExpressions :: GlobalAggregates :: UnresolvedHavingClauseAttributes :: - RemoveEvaluationFromSort :: HiveTypeCoercion.typeCoercionRules ++ extendedResolutionRules : _*), Batch("Nondeterministic", Once, @@ -955,63 +954,6 @@ class Analyzer( Project(p.output, newPlan.withNewChildren(newChild :: Nil)) } } - - /** - * Removes all still-need-evaluate ordering expressions from sort and use an inner project to - * materialize them, finally use a outer project to project them away to keep the result same. - * Then we can make sure we only sort by [[AttributeReference]]s. - * - * As an example, - * {{{ - * Sort('a, 'b + 1, - * Relation('a, 'b)) - * }}} - * will be turned into: - * {{{ - * Project('a, 'b, - * Sort('a, '_sortCondition, - * Project('a, 'b, ('b + 1).as("_sortCondition"), - * Relation('a, 'b - * }}} - */ - object RemoveEvaluationFromSort extends Rule[LogicalPlan] { -private def hasAlias(expr: Expression) = { - expr.find { -case a: Alias => true -case _ => false - }.isDefined -} - -override def apply(plan: LogicalPlan): LogicalPlan = plan transform { - // The ordering expressions have no effect to the output schema of `Sort`, - // so `Alias`s in ordering expressions are unnecessary and we should remove them. - case s @ Sort(ordering, _, _) if ordering.exists(hasAlias) => -val newOrdering = ordering.map(_.transformUp { - case Alias(child, _) => child -}.asInstanceOf[SortOrder]) -s.copy(order = newOrdering) - - case s @ Sort(ordering, global, child) -if s.expressions.forall(_.resolved) && s.childrenResolved && !s.hasNoEvaluation => - -val (ref, needEval) = ordering.partition(_.child.isInstanceOf[AttributeReference]) - -val namedExpr = needEval.map(_.child match { - case n: NamedExpression => n - case e => Alias(e, "_sortCondition")() -}) - -val newOrdering = ref ++ needEval.zip(namedExpr).map { case (order, ne) => - order.copy(child = ne.toAttribute) -} - -// Add still-need-evaluate ordering expressions into inner project and then project -// them away after the sort. -Project(child.output, - Sort(newOrdering, global, -Project(child.output ++ namedExpr, child))) -} - } } /** http://git-wip-us.apache.org/repos/asf/spark/blob/34a0eb2e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala index aacfc8
spark git commit: [SPARK-9562] Change reference to amplab/spark-ec2 from mesos/
Repository: spark Updated Branches: refs/heads/master b5034c9c5 -> 6a0f8b994 [SPARK-9562] Change reference to amplab/spark-ec2 from mesos/ cc srowen pwendell nchammas Author: Shivaram Venkataraman Closes #7899 from shivaram/spark-ec2-move and squashes the following commits: 7cc22c9 [Shivaram Venkataraman] Change reference to amplab/spark-ec2 from mesos/ Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a0f8b99 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a0f8b99 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a0f8b99 Branch: refs/heads/master Commit: 6a0f8b994de36b7a7bdfb9958d39dbd011776107 Parents: b5034c9 Author: Shivaram Venkataraman Authored: Tue Aug 4 09:40:07 2015 -0700 Committer: Shivaram Venkataraman Committed: Tue Aug 4 09:40:07 2015 -0700 -- ec2/spark_ec2.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6a0f8b99/ec2/spark_ec2.py -- diff --git a/ec2/spark_ec2.py b/ec2/spark_ec2.py index ccf922d..11fd7ee 100755 --- a/ec2/spark_ec2.py +++ b/ec2/spark_ec2.py @@ -90,7 +90,7 @@ DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION DEFAULT_SPARK_GITHUB_REPO = "https://github.com/apache/spark"; # Default location to get the spark-ec2 scripts (and ami-list) from -DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/mesos/spark-ec2"; +DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/amplab/spark-ec2"; DEFAULT_SPARK_EC2_BRANCH = "branch-1.4" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9562] Change reference to amplab/spark-ec2 from mesos/
Repository: spark Updated Branches: refs/heads/branch-1.5 d875368ed -> aa8390dfc [SPARK-9562] Change reference to amplab/spark-ec2 from mesos/ cc srowen pwendell nchammas Author: Shivaram Venkataraman Closes #7899 from shivaram/spark-ec2-move and squashes the following commits: 7cc22c9 [Shivaram Venkataraman] Change reference to amplab/spark-ec2 from mesos/ (cherry picked from commit 6a0f8b994de36b7a7bdfb9958d39dbd011776107) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa8390df Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa8390df Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa8390df Branch: refs/heads/branch-1.5 Commit: aa8390dfcbb45eeff3d5894cf9b2edbd245b7320 Parents: d875368 Author: Shivaram Venkataraman Authored: Tue Aug 4 09:40:07 2015 -0700 Committer: Shivaram Venkataraman Committed: Tue Aug 4 09:40:24 2015 -0700 -- ec2/spark_ec2.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aa8390df/ec2/spark_ec2.py -- diff --git a/ec2/spark_ec2.py b/ec2/spark_ec2.py index ccf922d..11fd7ee 100755 --- a/ec2/spark_ec2.py +++ b/ec2/spark_ec2.py @@ -90,7 +90,7 @@ DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION DEFAULT_SPARK_GITHUB_REPO = "https://github.com/apache/spark"; # Default location to get the spark-ec2 scripts (and ami-list) from -DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/mesos/spark-ec2"; +DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/amplab/spark-ec2"; DEFAULT_SPARK_EC2_BRANCH = "branch-1.4" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9541] [SQL] DataTimeUtils cleanup
Repository: spark Updated Branches: refs/heads/branch-1.5 b42e13dca -> d875368ed [SPARK-9541] [SQL] DataTimeUtils cleanup JIRA: https://issues.apache.org/jira/browse/SPARK-9541 Author: Yijie Shen Closes #7870 from yjshen/datetime_cleanup and squashes the following commits: 9203e33 [Yijie Shen] revert getMonth & getDayOfMonth 5cad119 [Yijie Shen] rebase code 7d62a74 [Yijie Shen] remove tmp tuple inside split date e98aaac [Yijie Shen] DataTimeUtils cleanup (cherry picked from commit b5034c9c59947f20423faa46bc6606aad56836b0) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d875368e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d875368e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d875368e Branch: refs/heads/branch-1.5 Commit: d875368edd7265cedf808c921c0af0deb4895a67 Parents: b42e13d Author: Yijie Shen Authored: Tue Aug 4 09:09:52 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 09:10:02 2015 -0700 -- .../spark/sql/catalyst/util/DateTimeUtils.scala | 142 --- 1 file changed, 57 insertions(+), 85 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d875368e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index f645eb5..063940c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -494,6 +494,50 @@ object DateTimeUtils { } /** + * Split date (expressed in days since 1.1.1970) into four fields: + * year, month (Jan is Month 1), dayInMonth, daysToMonthEnd (0 if it's last day of month). + */ + def splitDate(date: Int): (Int, Int, Int, Int) = { +var (year, dayInYear) = getYearAndDayInYear(date) +val isLeap = isLeapYear(year) +if (isLeap && dayInYear == 60) { + (year, 2, 29, 0) +} else { + if (isLeap && dayInYear > 60) dayInYear -= 1 + + if (dayInYear <= 181) { +if (dayInYear <= 31) { + (year, 1, dayInYear, 31 - dayInYear) +} else if (dayInYear <= 59) { + (year, 2, dayInYear - 31, if (isLeap) 60 - dayInYear else 59 - dayInYear) +} else if (dayInYear <= 90) { + (year, 3, dayInYear - 59, 90 - dayInYear) +} else if (dayInYear <= 120) { + (year, 4, dayInYear - 90, 120 - dayInYear) +} else if (dayInYear <= 151) { + (year, 5, dayInYear - 120, 151 - dayInYear) +} else { + (year, 6, dayInYear - 151, 181 - dayInYear) +} + } else { +if (dayInYear <= 212) { + (year, 7, dayInYear - 181, 212 - dayInYear) +} else if (dayInYear <= 243) { + (year, 8, dayInYear - 212, 243 - dayInYear) +} else if (dayInYear <= 273) { + (year, 9, dayInYear - 243, 273 - dayInYear) +} else if (dayInYear <= 304) { + (year, 10, dayInYear - 273, 304 - dayInYear) +} else if (dayInYear <= 334) { + (year, 11, dayInYear - 304, 334 - dayInYear) +} else { + (year, 12, dayInYear - 334, 365 - dayInYear) +} + } +} + } + + /** * Returns the month value for the given date. The date is expressed in days * since 1.1.1970. January is month 1. */ @@ -613,15 +657,16 @@ object DateTimeUtils { * Returns a date value, expressed in days since 1.1.1970. */ def dateAddMonths(days: Int, months: Int): Int = { -val absoluteMonth = (getYear(days) - YearZero) * 12 + getMonth(days) - 1 + months +val (year, monthInYear, dayOfMonth, daysToMonthEnd) = splitDate(days) +val absoluteMonth = (year - YearZero) * 12 + monthInYear - 1 + months val nonNegativeMonth = if (absoluteMonth >= 0) absoluteMonth else 0 val currentMonthInYear = nonNegativeMonth % 12 val currentYear = nonNegativeMonth / 12 + val leapDay = if (currentMonthInYear == 1 && isLeapYear(currentYear + YearZero)) 1 else 0 val lastDayOfMonth = monthDays(currentMonthInYear) + leapDay -val dayOfMonth = getDayOfMonth(days) -val currentDayInMonth = if (getDayOfMonth(days + 1) == 1 || dayOfMonth >= lastDayOfMonth) { +val currentDayInMonth = if (daysToMonthEnd == 0 || dayOfMonth >= lastDayOfMonth) { // last day of the month lastDayOfMonth } else { @@ -641,46 +686,6 @@ object DateTimeUtils { } /** - * Returns the last dayInMonth in the month it belongs to. The date is expressed - * in days since 1.1.1970. the return value starts from
spark git commit: [SPARK-9541] [SQL] DataTimeUtils cleanup
Repository: spark Updated Branches: refs/heads/master 73dedb589 -> b5034c9c5 [SPARK-9541] [SQL] DataTimeUtils cleanup JIRA: https://issues.apache.org/jira/browse/SPARK-9541 Author: Yijie Shen Closes #7870 from yjshen/datetime_cleanup and squashes the following commits: 9203e33 [Yijie Shen] revert getMonth & getDayOfMonth 5cad119 [Yijie Shen] rebase code 7d62a74 [Yijie Shen] remove tmp tuple inside split date e98aaac [Yijie Shen] DataTimeUtils cleanup Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5034c9c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5034c9c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5034c9c Branch: refs/heads/master Commit: b5034c9c59947f20423faa46bc6606aad56836b0 Parents: 73dedb5 Author: Yijie Shen Authored: Tue Aug 4 09:09:52 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 09:09:52 2015 -0700 -- .../spark/sql/catalyst/util/DateTimeUtils.scala | 142 --- 1 file changed, 57 insertions(+), 85 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b5034c9c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index f645eb5..063940c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -494,6 +494,50 @@ object DateTimeUtils { } /** + * Split date (expressed in days since 1.1.1970) into four fields: + * year, month (Jan is Month 1), dayInMonth, daysToMonthEnd (0 if it's last day of month). + */ + def splitDate(date: Int): (Int, Int, Int, Int) = { +var (year, dayInYear) = getYearAndDayInYear(date) +val isLeap = isLeapYear(year) +if (isLeap && dayInYear == 60) { + (year, 2, 29, 0) +} else { + if (isLeap && dayInYear > 60) dayInYear -= 1 + + if (dayInYear <= 181) { +if (dayInYear <= 31) { + (year, 1, dayInYear, 31 - dayInYear) +} else if (dayInYear <= 59) { + (year, 2, dayInYear - 31, if (isLeap) 60 - dayInYear else 59 - dayInYear) +} else if (dayInYear <= 90) { + (year, 3, dayInYear - 59, 90 - dayInYear) +} else if (dayInYear <= 120) { + (year, 4, dayInYear - 90, 120 - dayInYear) +} else if (dayInYear <= 151) { + (year, 5, dayInYear - 120, 151 - dayInYear) +} else { + (year, 6, dayInYear - 151, 181 - dayInYear) +} + } else { +if (dayInYear <= 212) { + (year, 7, dayInYear - 181, 212 - dayInYear) +} else if (dayInYear <= 243) { + (year, 8, dayInYear - 212, 243 - dayInYear) +} else if (dayInYear <= 273) { + (year, 9, dayInYear - 243, 273 - dayInYear) +} else if (dayInYear <= 304) { + (year, 10, dayInYear - 273, 304 - dayInYear) +} else if (dayInYear <= 334) { + (year, 11, dayInYear - 304, 334 - dayInYear) +} else { + (year, 12, dayInYear - 334, 365 - dayInYear) +} + } +} + } + + /** * Returns the month value for the given date. The date is expressed in days * since 1.1.1970. January is month 1. */ @@ -613,15 +657,16 @@ object DateTimeUtils { * Returns a date value, expressed in days since 1.1.1970. */ def dateAddMonths(days: Int, months: Int): Int = { -val absoluteMonth = (getYear(days) - YearZero) * 12 + getMonth(days) - 1 + months +val (year, monthInYear, dayOfMonth, daysToMonthEnd) = splitDate(days) +val absoluteMonth = (year - YearZero) * 12 + monthInYear - 1 + months val nonNegativeMonth = if (absoluteMonth >= 0) absoluteMonth else 0 val currentMonthInYear = nonNegativeMonth % 12 val currentYear = nonNegativeMonth / 12 + val leapDay = if (currentMonthInYear == 1 && isLeapYear(currentYear + YearZero)) 1 else 0 val lastDayOfMonth = monthDays(currentMonthInYear) + leapDay -val dayOfMonth = getDayOfMonth(days) -val currentDayInMonth = if (getDayOfMonth(days + 1) == 1 || dayOfMonth >= lastDayOfMonth) { +val currentDayInMonth = if (daysToMonthEnd == 0 || dayOfMonth >= lastDayOfMonth) { // last day of the month lastDayOfMonth } else { @@ -641,46 +686,6 @@ object DateTimeUtils { } /** - * Returns the last dayInMonth in the month it belongs to. The date is expressed - * in days since 1.1.1970. the return value starts from 1. - */ - private def getLastDayInMonthOfMonth(date: Int): Int = { -var (year, dayInYear) = getYe
spark git commit: [SPARK-8246] [SQL] Implement get_json_object
Repository: spark Updated Branches: refs/heads/master b1f88a38d -> 73dedb589 [SPARK-8246] [SQL] Implement get_json_object This is based on #7485 , thanks to NathanHowell Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead. Author: Davies Liu Author: Yin Huai Author: Nathan Howell Closes #7901 from davies/get_json_object and squashes the following commits: 3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object 98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala c818519 [Yin Huai] new results. 18ce26b [Davies Liu] fix tests 6ac29fb [Yin Huai] Golden files. 25eebef [Davies Liu] use HiveQuerySuite e0ac6ec [Yin Huai] Golden answer files. 940c060 [Davies Liu] tweat code style 44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object 9192d09 [Nathan Howell] Match Hiveâs behavior for unwrapping arrays of one element 8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/73dedb58 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/73dedb58 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/73dedb58 Branch: refs/heads/master Commit: 73dedb589d06f7c7a525cc4f07721a77f480c434 Parents: b1f88a3 Author: Davies Liu Authored: Tue Aug 4 09:07:09 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 09:07:09 2015 -0700 -- .../catalyst/analysis/FunctionRegistry.scala| 1 + .../catalyst/expressions/jsonFunctions.scala| 309 +++ .../expressions/JsonExpressionsSuite.scala | 202 .../apache/spark/sql/JsonFunctionsSuite.scala | 32 ++ .../hive/execution/HiveCompatibilitySuite.scala | 3 + .../apache/spark/sql/hive/test/TestHive.scala | 6 +- ...object #1-0-f01b340b5662c45bb5f1e3b7c6900e1f | 1 + ...bject #10-0-f3f47d06d7c51d493d68112b0bd6c1fc | 1 + ..._object #2-0-e84c2f8136919830fd665a278e4158a | 1 + ...object #3-0-bf140c65c31f8d892ec23e41e16e58bb | 1 + ...object #4-0-f0bd902edc1990c9a6c65a6bb672c4d5 | 1 + ..._object #5-0-3c09f4316a1533049aee8af749cdcab | 1 + ...object #6-0-8334d1ddbe0f41fc7b80d4e6b45409da | 1 + ...object #7-0-40d7dff94b26a2e3f4ab71baee3d3ce0 | 1 + ...object #8-0-180b4b6fdb26011fec05a7ca99fd9844 | 1 + ...object #9-0-47c451a969d856f008f4d6b3d378d94b | 1 + .../sql/hive/execution/HiveQuerySuite.scala | 51 +++ 17 files changed, 613 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/73dedb58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index 6140d1b..43e3e9b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -179,6 +179,7 @@ object FunctionRegistry { expression[Decode]("decode"), expression[FindInSet]("find_in_set"), expression[FormatNumber]("format_number"), +expression[GetJsonObject]("get_json_object"), expression[InitCap]("initcap"), expression[Lower]("lcase"), expression[Lower]("lower"), http://git-wip-us.apache.org/repos/asf/spark/blob/73dedb58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala new file mode 100644 index 000..23bfa18 --- /dev/null +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala @@ -0,0 +1,309 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compl
spark git commit: [SPARK-8246] [SQL] Implement get_json_object
Repository: spark Updated Branches: refs/heads/branch-1.5 945da3534 -> b42e13dca [SPARK-8246] [SQL] Implement get_json_object This is based on #7485 , thanks to NathanHowell Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead. Author: Davies Liu Author: Yin Huai Author: Nathan Howell Closes #7901 from davies/get_json_object and squashes the following commits: 3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object 98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala c818519 [Yin Huai] new results. 18ce26b [Davies Liu] fix tests 6ac29fb [Yin Huai] Golden files. 25eebef [Davies Liu] use HiveQuerySuite e0ac6ec [Yin Huai] Golden answer files. 940c060 [Davies Liu] tweat code style 44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object 9192d09 [Nathan Howell] Match Hiveâs behavior for unwrapping arrays of one element 8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object (cherry picked from commit 73dedb589d06f7c7a525cc4f07721a77f480c434) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b42e13dc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b42e13dc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b42e13dc Branch: refs/heads/branch-1.5 Commit: b42e13dca38c6e9ff9cf879bcb52efa681437120 Parents: 945da35 Author: Davies Liu Authored: Tue Aug 4 09:07:09 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 09:07:19 2015 -0700 -- .../catalyst/analysis/FunctionRegistry.scala| 1 + .../catalyst/expressions/jsonFunctions.scala| 309 +++ .../expressions/JsonExpressionsSuite.scala | 202 .../apache/spark/sql/JsonFunctionsSuite.scala | 32 ++ .../hive/execution/HiveCompatibilitySuite.scala | 3 + .../apache/spark/sql/hive/test/TestHive.scala | 6 +- ...object #1-0-f01b340b5662c45bb5f1e3b7c6900e1f | 1 + ...bject #10-0-f3f47d06d7c51d493d68112b0bd6c1fc | 1 + ..._object #2-0-e84c2f8136919830fd665a278e4158a | 1 + ...object #3-0-bf140c65c31f8d892ec23e41e16e58bb | 1 + ...object #4-0-f0bd902edc1990c9a6c65a6bb672c4d5 | 1 + ..._object #5-0-3c09f4316a1533049aee8af749cdcab | 1 + ...object #6-0-8334d1ddbe0f41fc7b80d4e6b45409da | 1 + ...object #7-0-40d7dff94b26a2e3f4ab71baee3d3ce0 | 1 + ...object #8-0-180b4b6fdb26011fec05a7ca99fd9844 | 1 + ...object #9-0-47c451a969d856f008f4d6b3d378d94b | 1 + .../sql/hive/execution/HiveQuerySuite.scala | 51 +++ 17 files changed, 613 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b42e13dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index 6140d1b..43e3e9b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -179,6 +179,7 @@ object FunctionRegistry { expression[Decode]("decode"), expression[FindInSet]("find_in_set"), expression[FormatNumber]("format_number"), +expression[GetJsonObject]("get_json_object"), expression[InitCap]("initcap"), expression[Lower]("lcase"), expression[Lower]("lower"), http://git-wip-us.apache.org/repos/asf/spark/blob/b42e13dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala new file mode 100644 index 000..23bfa18 --- /dev/null +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala @@ -0,0 +1,309 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to
spark git commit: [SPARK-8244] [SQL] string function: find in set
Repository: spark Updated Branches: refs/heads/branch-1.5 f44b27a2b -> 945da3534 [SPARK-8244] [SQL] string function: find in set This PR is based on #7186 (just fix the conflict), thanks to tarekauel . find_in_set(string str, string strList): int Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3. Only add this to SQL, not DataFrame. Closes #7186 Author: Tarek Auel Author: Davies Liu Closes #7900 from davies/find_in_set and squashes the following commits: 4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility 1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8 0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244 208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix 71b2e69 [Tarek Auel] [SPARK-8244] find_in_set 66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244 61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison 4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244 e3b20c8 [Tarek Auel] [SPARK-8244] added type check 1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/945da353 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/945da353 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/945da353 Branch: refs/heads/branch-1.5 Commit: 945da3534762a73fe7ffc52c868ff07a0783502b Parents: f44b27a Author: Tarek Auel Authored: Tue Aug 4 08:59:42 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 09:01:58 2015 -0700 -- .../catalyst/analysis/FunctionRegistry.scala| 1 + .../catalyst/expressions/stringOperations.scala | 25 -- .../expressions/StringExpressionsSuite.scala| 10 ++ .../spark/sql/DataFrameFunctionsSuite.scala | 8 + .../apache/spark/unsafe/types/UTF8String.java | 35 ++-- .../spark/unsafe/types/UTF8StringSuite.java | 12 +++ 6 files changed, 87 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/945da353/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index bc08466..6140d1b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -177,6 +177,7 @@ object FunctionRegistry { expression[ConcatWs]("concat_ws"), expression[Encode]("encode"), expression[Decode]("decode"), +expression[FindInSet]("find_in_set"), expression[FormatNumber]("format_number"), expression[InitCap]("initcap"), expression[Lower]("lcase"), http://git-wip-us.apache.org/repos/asf/spark/blob/945da353/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala index 5622529..0cc785d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala @@ -18,8 +18,7 @@ package org.apache.spark.sql.catalyst.expressions import java.text.DecimalFormat -import java.util.Arrays -import java.util.Locale +import java.util.{Arrays, Locale} import java.util.regex.{MatchResult, Pattern} import org.apache.commons.lang3.StringEscapeUtils @@ -351,6 +350,28 @@ case class EndsWith(left: Expression, right: Expression) } /** + * A function that returns the index (1-based) of the given string (left) in the comma- + * delimited list (right). Returns 0, if the string wasn't found or if the given + * string (left) contains a comma. + */ +case class FindInSet(left: Expression, right: Expression) extends BinaryExpression +with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType) + + override protected def nullSafeEva
spark git commit: [SPARK-8244] [SQL] string function: find in set
Repository: spark Updated Branches: refs/heads/master d702d5373 -> b1f88a38d [SPARK-8244] [SQL] string function: find in set This PR is based on #7186 (just fix the conflict), thanks to tarekauel . find_in_set(string str, string strList): int Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3. Only add this to SQL, not DataFrame. Closes #7186 Author: Tarek Auel Author: Davies Liu Closes #7900 from davies/find_in_set and squashes the following commits: 4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility 1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8 0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244 208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix 71b2e69 [Tarek Auel] [SPARK-8244] find_in_set 66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244 61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison 4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244 e3b20c8 [Tarek Auel] [SPARK-8244] added type check 1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b1f88a38 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b1f88a38 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b1f88a38 Branch: refs/heads/master Commit: b1f88a38d53aebe7cabb762cdd2f1cc64726b0b4 Parents: d702d53 Author: Tarek Auel Authored: Tue Aug 4 08:59:42 2015 -0700 Committer: Davies Liu Committed: Tue Aug 4 08:59:42 2015 -0700 -- .../catalyst/analysis/FunctionRegistry.scala| 1 + .../catalyst/expressions/stringOperations.scala | 25 -- .../expressions/StringExpressionsSuite.scala| 10 ++ .../spark/sql/DataFrameFunctionsSuite.scala | 8 + .../apache/spark/unsafe/types/UTF8String.java | 35 ++-- .../spark/unsafe/types/UTF8StringSuite.java | 12 +++ 6 files changed, 87 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b1f88a38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index bc08466..6140d1b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -177,6 +177,7 @@ object FunctionRegistry { expression[ConcatWs]("concat_ws"), expression[Encode]("encode"), expression[Decode]("decode"), +expression[FindInSet]("find_in_set"), expression[FormatNumber]("format_number"), expression[InitCap]("initcap"), expression[Lower]("lcase"), http://git-wip-us.apache.org/repos/asf/spark/blob/b1f88a38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala index 5622529..0cc785d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala @@ -18,8 +18,7 @@ package org.apache.spark.sql.catalyst.expressions import java.text.DecimalFormat -import java.util.Arrays -import java.util.Locale +import java.util.{Arrays, Locale} import java.util.regex.{MatchResult, Pattern} import org.apache.commons.lang3.StringEscapeUtils @@ -351,6 +350,28 @@ case class EndsWith(left: Expression, right: Expression) } /** + * A function that returns the index (1-based) of the given string (left) in the comma- + * delimited list (right). Returns 0, if the string wasn't found or if the given + * string (left) contains a comma. + */ +case class FindInSet(left: Expression, right: Expression) extends BinaryExpression +with ImplicitCastInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType) + + override protected def nullSafeEval(word:
spark git commit: [SPARK-9583] [BUILD] Do not print mvn debug messages to stdout.
Repository: spark Updated Branches: refs/heads/branch-1.5 45c8d2bb8 -> f44b27a2b [SPARK-9583] [BUILD] Do not print mvn debug messages to stdout. This allows build/mvn to be used by make-distribution.sh. Author: Marcelo Vanzin Closes #7915 from vanzin/SPARK-9583 and squashes the following commits: 6469e60 [Marcelo Vanzin] [SPARK-9583] [build] Do not print mvn debug messages to stdout. (cherry picked from commit d702d53732b44e8242448ce5302738bd130717d8) Signed-off-by: Kousuke Saruta Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f44b27a2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f44b27a2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f44b27a2 Branch: refs/heads/branch-1.5 Commit: f44b27a2b92da2325ed9389cd27b6e2cfd9ec486 Parents: 45c8d2b Author: Marcelo Vanzin Authored: Tue Aug 4 22:19:11 2015 +0900 Committer: Kousuke Saruta Committed: Tue Aug 4 22:19:31 2015 +0900 -- build/mvn | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f44b27a2/build/mvn -- diff --git a/build/mvn b/build/mvn index f62f61e..4a1664b 100755 --- a/build/mvn +++ b/build/mvn @@ -51,11 +51,11 @@ install_app() { # check if we have curl installed # download application [ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \ - echo "exec: curl ${curl_opts} ${remote_tarball}" && \ + echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \ curl ${curl_opts} "${remote_tarball}" > "${local_tarball}" # if the file still doesn't exist, lets try `wget` and cross our fingers [ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \ - echo "exec: wget ${wget_opts} ${remote_tarball}" && \ + echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \ wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}" # if both were unsuccessful, exit [ ! -f "${local_tarball}" ] && \ @@ -146,7 +146,7 @@ fi # Set any `mvn` options if not already present export MAVEN_OPTS=${MAVEN_OPTS:-"$_COMPILE_JVM_OPTS"} -echo "Using \`mvn\` from path: $MVN_BIN" +echo "Using \`mvn\` from path: $MVN_BIN" 1>&2 # Last, call the `mvn` command as usual ${MVN_BIN} "$@" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9583] [BUILD] Do not print mvn debug messages to stdout.
Repository: spark Updated Branches: refs/heads/master cb7fa0aa9 -> d702d5373 [SPARK-9583] [BUILD] Do not print mvn debug messages to stdout. This allows build/mvn to be used by make-distribution.sh. Author: Marcelo Vanzin Closes #7915 from vanzin/SPARK-9583 and squashes the following commits: 6469e60 [Marcelo Vanzin] [SPARK-9583] [build] Do not print mvn debug messages to stdout. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d702d537 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d702d537 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d702d537 Branch: refs/heads/master Commit: d702d53732b44e8242448ce5302738bd130717d8 Parents: cb7fa0a Author: Marcelo Vanzin Authored: Tue Aug 4 22:19:11 2015 +0900 Committer: Kousuke Saruta Committed: Tue Aug 4 22:19:11 2015 +0900 -- build/mvn | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d702d537/build/mvn -- diff --git a/build/mvn b/build/mvn index f62f61e..4a1664b 100755 --- a/build/mvn +++ b/build/mvn @@ -51,11 +51,11 @@ install_app() { # check if we have curl installed # download application [ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \ - echo "exec: curl ${curl_opts} ${remote_tarball}" && \ + echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \ curl ${curl_opts} "${remote_tarball}" > "${local_tarball}" # if the file still doesn't exist, lets try `wget` and cross our fingers [ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \ - echo "exec: wget ${wget_opts} ${remote_tarball}" && \ + echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \ wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}" # if both were unsuccessful, exit [ ! -f "${local_tarball}" ] && \ @@ -146,7 +146,7 @@ fi # Set any `mvn` options if not already present export MAVEN_OPTS=${MAVEN_OPTS:-"$_COMPILE_JVM_OPTS"} -echo "Using \`mvn\` from path: $MVN_BIN" +echo "Using \`mvn\` from path: $MVN_BIN" 1>&2 # Last, call the `mvn` command as usual ${MVN_BIN} "$@" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page
Repository: spark Updated Branches: refs/heads/branch-1.5 bd9b75213 -> 45c8d2bb8 [SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page Add pagination for the RDD page to avoid unresponsive UI when the number of the RDD partitions is large. Before: ![rddpagebefore](https://cloud.githubusercontent.com/assets/9278199/8951533/3d9add54-3601-11e5-99d0-5653b473c49b.png) After: ![rddpageafter](https://cloud.githubusercontent.com/assets/9278199/8951536/439d66e0-3601-11e5-9cee-1b380fe6620d.png) Author: Carson Wang Closes #7692 from carsonwang/SPARK-2016 and squashes the following commits: 03c7168 [Carson Wang] Fix style issues 612c18c [Carson Wang] RDD partition table pagination for the RDD Page (cherry picked from commit cb7fa0aa93dae5a25a8e7e387dbd6b55a5a23fb0) Signed-off-by: Kousuke Saruta Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45c8d2bb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45c8d2bb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45c8d2bb Branch: refs/heads/branch-1.5 Commit: 45c8d2bb872bb905a402cf3aa78b1c4efaac07cf Parents: bd9b752 Author: Carson Wang Authored: Tue Aug 4 22:12:30 2015 +0900 Committer: Kousuke Saruta Committed: Tue Aug 4 22:12:54 2015 +0900 -- .../scala/org/apache/spark/ui/PagedTable.scala | 16 +- .../org/apache/spark/ui/jobs/StagePage.scala| 10 +- .../org/apache/spark/ui/storage/RDDPage.scala | 228 --- 3 files changed, 209 insertions(+), 45 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/45c8d2bb/core/src/main/scala/org/apache/spark/ui/PagedTable.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala index 17d7b39..6e23754 100644 --- a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala +++ b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala @@ -159,9 +159,9 @@ private[ui] trait PagedTable[T] { // "goButtonJsFuncName" val formJs = s"""$$(function(){ - | $$( "#form-task-page" ).submit(function(event) { - |var page = $$("#form-task-page-no").val() - |var pageSize = $$("#form-task-page-size").val() + | $$( "#form-$tableId-page" ).submit(function(event) { + |var page = $$("#form-$tableId-page-no").val() + |var pageSize = $$("#form-$tableId-page-size").val() |pageSize = pageSize ? pageSize: 100; |if (page != "") { | ${goButtonJsFuncName}(page, pageSize); @@ -173,12 +173,14 @@ private[ui] trait PagedTable[T] { - + {totalPages} Pages. Jump to - + . Show - -tasks in a page. + +items in a page. Go http://git-wip-us.apache.org/repos/asf/spark/blob/45c8d2bb/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala index 3954c3d..0c94204 100644 --- a/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala +++ b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala @@ -988,8 +988,7 @@ private[ui] class TaskDataSource( shuffleRead, shuffleWrite, bytesSpilled, - errorMessage.getOrElse("") -) + errorMessage.getOrElse("")) } /** @@ -1197,7 +1196,7 @@ private[ui] class TaskPagedTable( private val displayPeakExecutionMemory = conf.getOption("spark.sql.unsafe.enabled").exists(_.toBoolean) - override def tableId: String = "" + override def tableId: String = "task-table" override def tableCssClass: String = "table table-bordered table-condensed table-striped" @@ -1212,8 +1211,7 @@ private[ui] class TaskPagedTable( currentTime, pageSize, sortColumn, -desc - ) +desc) override def pageLink(page: Int): String = { val encodedSortColumn = URLEncoder.encode(sortColumn, "UTF-8") @@ -1277,7 +1275,7 @@ private[ui] class TaskPagedTable( Seq(("Errors", "")) if (!taskHeadersAndCssClasses.map(_._1).contains(sortColumn)) { - new IllegalArgumentException(s"Unknown column: $sortColumn") + throw new IllegalArgumentException(s"Unknown column: $sortColumn") } val headerRow: Seq[Node] = { http://git-wip-us.apache.org/repos/asf/spark/blob/45c8d2bb/core/src/main/scala/org/apache/spark/ui/storage/RDDPage.scala -- diff --git a/core/s
spark git commit: [SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page
Repository: spark Updated Branches: refs/heads/master b211cbc73 -> cb7fa0aa9 [SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page Add pagination for the RDD page to avoid unresponsive UI when the number of the RDD partitions is large. Before: ![rddpagebefore](https://cloud.githubusercontent.com/assets/9278199/8951533/3d9add54-3601-11e5-99d0-5653b473c49b.png) After: ![rddpageafter](https://cloud.githubusercontent.com/assets/9278199/8951536/439d66e0-3601-11e5-9cee-1b380fe6620d.png) Author: Carson Wang Closes #7692 from carsonwang/SPARK-2016 and squashes the following commits: 03c7168 [Carson Wang] Fix style issues 612c18c [Carson Wang] RDD partition table pagination for the RDD Page Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb7fa0aa Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb7fa0aa Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb7fa0aa Branch: refs/heads/master Commit: cb7fa0aa93dae5a25a8e7e387dbd6b55a5a23fb0 Parents: b211cbc Author: Carson Wang Authored: Tue Aug 4 22:12:30 2015 +0900 Committer: Kousuke Saruta Committed: Tue Aug 4 22:12:30 2015 +0900 -- .../scala/org/apache/spark/ui/PagedTable.scala | 16 +- .../org/apache/spark/ui/jobs/StagePage.scala| 10 +- .../org/apache/spark/ui/storage/RDDPage.scala | 228 --- 3 files changed, 209 insertions(+), 45 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cb7fa0aa/core/src/main/scala/org/apache/spark/ui/PagedTable.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala index 17d7b39..6e23754 100644 --- a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala +++ b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala @@ -159,9 +159,9 @@ private[ui] trait PagedTable[T] { // "goButtonJsFuncName" val formJs = s"""$$(function(){ - | $$( "#form-task-page" ).submit(function(event) { - |var page = $$("#form-task-page-no").val() - |var pageSize = $$("#form-task-page-size").val() + | $$( "#form-$tableId-page" ).submit(function(event) { + |var page = $$("#form-$tableId-page-no").val() + |var pageSize = $$("#form-$tableId-page-size").val() |pageSize = pageSize ? pageSize: 100; |if (page != "") { | ${goButtonJsFuncName}(page, pageSize); @@ -173,12 +173,14 @@ private[ui] trait PagedTable[T] { - + {totalPages} Pages. Jump to - + . Show - -tasks in a page. + +items in a page. Go http://git-wip-us.apache.org/repos/asf/spark/blob/cb7fa0aa/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala index 3954c3d..0c94204 100644 --- a/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala +++ b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala @@ -988,8 +988,7 @@ private[ui] class TaskDataSource( shuffleRead, shuffleWrite, bytesSpilled, - errorMessage.getOrElse("") -) + errorMessage.getOrElse("")) } /** @@ -1197,7 +1196,7 @@ private[ui] class TaskPagedTable( private val displayPeakExecutionMemory = conf.getOption("spark.sql.unsafe.enabled").exists(_.toBoolean) - override def tableId: String = "" + override def tableId: String = "task-table" override def tableCssClass: String = "table table-bordered table-condensed table-striped" @@ -1212,8 +1211,7 @@ private[ui] class TaskPagedTable( currentTime, pageSize, sortColumn, -desc - ) +desc) override def pageLink(page: Int): String = { val encodedSortColumn = URLEncoder.encode(sortColumn, "UTF-8") @@ -1277,7 +1275,7 @@ private[ui] class TaskPagedTable( Seq(("Errors", "")) if (!taskHeadersAndCssClasses.map(_._1).contains(sortColumn)) { - new IllegalArgumentException(s"Unknown column: $sortColumn") + throw new IllegalArgumentException(s"Unknown column: $sortColumn") } val headerRow: Seq[Node] = { http://git-wip-us.apache.org/repos/asf/spark/blob/cb7fa0aa/core/src/main/scala/org/apache/spark/ui/storage/RDDPage.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/storage/RDDPage.scala b/core/src/main/scala/org/apache/spark/ui/storage/RD
spark git commit: [SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was accidentally reverted
Repository: spark Updated Branches: refs/heads/branch-1.5 5ae675360 -> bd9b75213 [SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was accidentally reverted This PR removes the dependency reduced POM hack brought back by #7191 Author: tedyu Closes #7919 from tedyu/master and squashes the following commits: 1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack (cherry picked from commit b211cbc7369af5eb2cb65d93c4c57c4db7143f47) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bd9b7521 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bd9b7521 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bd9b7521 Branch: refs/heads/branch-1.5 Commit: bd9b7521343c34c42be40ee05a01c8a976ed2307 Parents: 5ae6753 Author: tedyu Authored: Tue Aug 4 12:22:53 2015 +0100 Committer: Sean Owen Committed: Tue Aug 4 12:24:25 2015 +0100 -- pom.xml | 3 --- 1 file changed, 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bd9b7521/pom.xml -- diff --git a/pom.xml b/pom.xml index b4ee3cc..2bcc55b 100644 --- a/pom.xml +++ b/pom.xml @@ -179,9 +179,6 @@ 1.3.9 0.9.2 - -false - ${java.home}
spark git commit: [SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was accidentally reverted
Repository: spark Updated Branches: refs/heads/master 76d74090d -> b211cbc73 [SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was accidentally reverted This PR removes the dependency reduced POM hack brought back by #7191 Author: tedyu Closes #7919 from tedyu/master and squashes the following commits: 1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b211cbc7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b211cbc7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b211cbc7 Branch: refs/heads/master Commit: b211cbc7369af5eb2cb65d93c4c57c4db7143f47 Parents: 76d7409 Author: tedyu Authored: Tue Aug 4 12:22:53 2015 +0100 Committer: Sean Owen Committed: Tue Aug 4 12:23:04 2015 +0100 -- pom.xml | 3 --- 1 file changed, 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b211cbc7/pom.xml -- diff --git a/pom.xml b/pom.xml index b4ee3cc..2bcc55b 100644 --- a/pom.xml +++ b/pom.xml @@ -179,9 +179,6 @@ 1.3.9 0.9.2 - -false - ${java.home}
spark git commit: [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Repository: spark Updated Branches: refs/heads/branch-1.5 29f2d5a06 -> 5ae675360 [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. I'll explain several of the changes inline in comments. Author: Sean Owen Closes #7862 from srowen/SPARK-9534 and squashes the following commits: ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. (cherry picked from commit 76d74090d60f74412bd45487e8db6aff2e8343a2) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ae67536 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ae67536 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ae67536 Branch: refs/heads/branch-1.5 Commit: 5ae675360d883483e509788b8867c1c98b4820fd Parents: 29f2d5a Author: Sean Owen Authored: Tue Aug 4 12:02:26 2015 +0100 Committer: Sean Owen Committed: Tue Aug 4 12:02:45 2015 +0100 -- .../java/JavaSparkContextVarargsWorkaround.java | 19 +++--- .../spark/storage/TachyonBlockManager.scala | 9 ++- .../deploy/master/PersistenceEngineSuite.scala | 13 ++-- .../mesos/MesosSchedulerUtilsSuite.scala| 3 + .../spark/examples/ml/JavaOneVsRestExample.java | 1 + .../streaming/JavaStatefulNetworkWordCount.java | 4 +- .../kafka/JavaDirectKafkaStreamSuite.java | 2 +- .../evaluation/JavaRankingMetricsSuite.java | 14 ++--- .../ml/classification/NaiveBayesSuite.scala | 4 +- .../network/protocol/ChunkFetchFailure.java | 5 ++ .../network/protocol/ChunkFetchRequest.java | 5 ++ .../network/protocol/ChunkFetchSuccess.java | 5 ++ .../spark/network/protocol/RpcFailure.java | 5 ++ .../spark/network/protocol/RpcRequest.java | 5 ++ .../spark/network/protocol/RpcResponse.java | 5 ++ .../apache/spark/network/TestManagedBuffer.java | 5 ++ .../spark/network/sasl/SparkSaslSuite.java | 16 ++--- .../ExternalShuffleBlockHandlerSuite.java | 6 +- .../shuffle/RetryingBlockFetcherSuite.java | 47 --- pom.xml | 4 ++ .../apache/spark/sql/JavaDataFrameSuite.java| 1 + .../spark/sql/sources/TableScanSuite.scala | 16 +++-- .../sql/hive/client/IsolatedClientLoader.scala | 2 +- .../spark/sql/hive/execution/commands.scala | 2 +- .../spark/sql/hive/JavaDataFrameSuite.java | 2 +- .../sql/hive/execution/SQLQuerySuite.scala | 4 +- .../streaming/scheduler/JobScheduler.scala | 4 +- .../spark/streaming/JavaWriteAheadLogSuite.java | 62 ++-- .../spark/streaming/UISeleniumSuite.scala | 4 +- 29 files changed, 167 insertions(+), 107 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5ae67536/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java -- diff --git a/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java b/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java index 2090efd..d4c42b3 100644 --- a/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java +++ b/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java @@ -23,11 +23,13 @@ import java.util.List; // See // http://scala-programming-language.1934581.n4.nabble.com/Workaround-for-implementing-java-varargs-in-2-7-2-final-tp1944767p1944772.html abstract class JavaSparkContextVarargsWorkaround { - public JavaRDD union(JavaRDD... rdds) { + + @SafeVarargs + public final JavaRDD union(JavaRDD... rdds) { if (rdds.length == 0) { throw new IllegalArgumentException("Union called on empty list"); } -ArrayList> rest = new ArrayList>(rdds.length - 1); +List> rest = new ArrayList<>(rdds.length - 1); for (int i = 1; i < rdds.length; i++) { rest.add(rdds[i]); } @@ -38,18 +40,19 @@ abstract class JavaSparkContextVarargsWorkaround { if (rdds.length == 0) { throw new IllegalArgumentException("Union called on empty list"); } -ArrayList rest = new ArrayList(rdds.length - 1); +List rest = new ArrayList<>(rdds.length - 1); for (int i = 1; i < rdds.length; i++) { rest.add(rdds[i]); } return union(rdds[0], rest); } - public JavaPairRDD union(JavaPairRDD... rdds) { + @SafeVarargs + public final JavaPairRDD union(JavaPairRDD... rdds) { if (rdds.length == 0) { throw new IllegalArgumentException("Union called on empty list"); } -ArrayList> rest = n
spark git commit: [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Repository: spark Updated Branches: refs/heads/master 9e952ecbc -> 76d74090d [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. I'll explain several of the changes inline in comments. Author: Sean Owen Closes #7862 from srowen/SPARK-9534 and squashes the following commits: ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76d74090 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76d74090 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76d74090 Branch: refs/heads/master Commit: 76d74090d60f74412bd45487e8db6aff2e8343a2 Parents: 9e952ec Author: Sean Owen Authored: Tue Aug 4 12:02:26 2015 +0100 Committer: Sean Owen Committed: Tue Aug 4 12:02:26 2015 +0100 -- .../java/JavaSparkContextVarargsWorkaround.java | 19 +++--- .../spark/storage/TachyonBlockManager.scala | 9 ++- .../deploy/master/PersistenceEngineSuite.scala | 13 ++-- .../mesos/MesosSchedulerUtilsSuite.scala| 3 + .../spark/examples/ml/JavaOneVsRestExample.java | 1 + .../streaming/JavaStatefulNetworkWordCount.java | 4 +- .../kafka/JavaDirectKafkaStreamSuite.java | 2 +- .../evaluation/JavaRankingMetricsSuite.java | 14 ++--- .../ml/classification/NaiveBayesSuite.scala | 4 +- .../network/protocol/ChunkFetchFailure.java | 5 ++ .../network/protocol/ChunkFetchRequest.java | 5 ++ .../network/protocol/ChunkFetchSuccess.java | 5 ++ .../spark/network/protocol/RpcFailure.java | 5 ++ .../spark/network/protocol/RpcRequest.java | 5 ++ .../spark/network/protocol/RpcResponse.java | 5 ++ .../apache/spark/network/TestManagedBuffer.java | 5 ++ .../spark/network/sasl/SparkSaslSuite.java | 16 ++--- .../ExternalShuffleBlockHandlerSuite.java | 6 +- .../shuffle/RetryingBlockFetcherSuite.java | 47 --- pom.xml | 4 ++ .../apache/spark/sql/JavaDataFrameSuite.java| 1 + .../spark/sql/sources/TableScanSuite.scala | 16 +++-- .../sql/hive/client/IsolatedClientLoader.scala | 2 +- .../spark/sql/hive/execution/commands.scala | 2 +- .../spark/sql/hive/JavaDataFrameSuite.java | 2 +- .../sql/hive/execution/SQLQuerySuite.scala | 4 +- .../streaming/scheduler/JobScheduler.scala | 4 +- .../spark/streaming/JavaWriteAheadLogSuite.java | 62 ++-- .../spark/streaming/UISeleniumSuite.scala | 4 +- 29 files changed, 167 insertions(+), 107 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/76d74090/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java -- diff --git a/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java b/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java index 2090efd..d4c42b3 100644 --- a/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java +++ b/core/src/main/java/org/apache/spark/api/java/JavaSparkContextVarargsWorkaround.java @@ -23,11 +23,13 @@ import java.util.List; // See // http://scala-programming-language.1934581.n4.nabble.com/Workaround-for-implementing-java-varargs-in-2-7-2-final-tp1944767p1944772.html abstract class JavaSparkContextVarargsWorkaround { - public JavaRDD union(JavaRDD... rdds) { + + @SafeVarargs + public final JavaRDD union(JavaRDD... rdds) { if (rdds.length == 0) { throw new IllegalArgumentException("Union called on empty list"); } -ArrayList> rest = new ArrayList>(rdds.length - 1); +List> rest = new ArrayList<>(rdds.length - 1); for (int i = 1; i < rdds.length; i++) { rest.add(rdds[i]); } @@ -38,18 +40,19 @@ abstract class JavaSparkContextVarargsWorkaround { if (rdds.length == 0) { throw new IllegalArgumentException("Union called on empty list"); } -ArrayList rest = new ArrayList(rdds.length - 1); +List rest = new ArrayList<>(rdds.length - 1); for (int i = 1; i < rdds.length; i++) { rest.add(rdds[i]); } return union(rdds[0], rest); } - public JavaPairRDD union(JavaPairRDD... rdds) { + @SafeVarargs + public final JavaPairRDD union(JavaPairRDD... rdds) { if (rdds.length == 0) { throw new IllegalArgumentException("Union called on empty list"); } -ArrayList> rest = new ArrayList>(rdds.length - 1); +List> rest = new ArrayList<>(rdds.length - 1); for (int i = 1;