+1. Ran some regression tests on Spark on Yarn (hadoop 2.6 and 2.7).
Tom
On Wednesday, December 16, 2015 3:32 PM, Michael Armbrust
<[email protected]> wrote:
Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0[ ] -1 Do not release this
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)
The release files, including signatures, digests, etc. can be found
at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following
key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found
at:https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found
at:https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found
at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
========================================= How can I help test this release?
=========================================If you are a Spark user, you can help
us test this release by taking an existing Spark workload and running on this
release candidate, then reporting any regressions.
================================================== What justifies a -1 vote for
this release? ==================================================This vote is
happening towards the end of the 1.6 QA period, so -1 votes should only occur
for significant regressions from 1.5. Bugs already present in 1.5, minor
regressions, or bugs related to new features will not block this release.
================================================================= What should
happen to JIRA tickets still targeting 1.6.0?
=================================================================1. It is OK
for documentation patches to target 1.6.0 and still go into branch-1.6, since
documentations will be published separately from the release.2. New features
for non-alpha-modules should target 1.7+.3. Non-blocker bug fixes should target
1.6.1 or 1.7.0, or drop the target version.
==================================================== Major changes to help you
focus your testing ====================================================
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
- SPARK-2629 trackStateByKey has been renamed to mapWithState
Spark SQL
- SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
execution.
- SPARK-12258 correct passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
- SPARK-11787 Parquet Performance - Improve Parquet scan performance when
using flat schemas.
- SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
even on shared clusters.
- SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs
many operations on serialized binary data and code generation (i.e. Project
Tungsten).
- SPARK-10000 Unified Memory Management - Shared memory for execution and
caching instead of exclusive division of the regions.
- SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
over files of any supported format without registering a table.
- SPARK-11745 Reading non-standard JSON files - Added options to read
non-standard JSON files (e.g. single-quotes, unquoted attributes)
- SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on
a peroperator basis for memory usage and spilled data size.
- SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
and unest arbitrary numbers of columns
- SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
Significant (up to 14x) speed up when caching data that contains complex types
in DataFrames or SQL.
- SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
will now execute using SortMergeJoin instead of computing a cartisian product.
- SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
query execution to occur using off-heap memory to avoid GC overhead
- SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
datasource with filter pushdown, developers can now tell Spark SQL to avoid
double evaluating a pushed-down filter.
- SPARK-4849 Advanced Layout of Cached Data - storing partitioning and
ordering schemes in In-memory table scan, and adding distributeBy and localSort
to DF API
- SPARK-9858 Adaptive query execution - Intial support for automatically
selecting the number of reducers for joins and aggregations.
- SPARK-9241 Improved query planner for queries having distinct
aggregations - Query plans of distinct aggregations are more robust when
distinct columns have high cardinality.
Spark Streaming
- API Updates
- SPARK-2629 New improved state management - mapWithState - a DStream
transformation for stateful stream processing, supercedes updateStateByKey in
functionality and performance.
- SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
upgraded to use KCL 1.4.0 and supports transparent deaggregation of
KPL-aggregated records.
- SPARK-10891 Kinesis message handler function - Allows arbitraray
function to be applied to a Kinesis record in the Kinesis receiver before to
customize what data is to be stored in memory.
- SPARK-6328 Python Streamng Listener API - Get streaming statistics
(scheduling delays, batch processing times, etc.) in streaming.
- UI Improvements
- Made failures visible in the streaming tab, in the timelines, batch
list, and batch details page.
- Made output operations visible in the streaming tab as progress bars.
MLlib
New algorithms/models
- SPARK-8518 Survival analysis - Log-linear model for survival analysis
- SPARK-9834 Normal equation for least squares - Normal equation solver,
providing R-like model summary statistics
- SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming
framework
- SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,
SQL transformer
- SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering
variant of K-Means
API improvements
- ML Pipelines
- SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with
partial coverage of spark.mlalgorithms
- SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation
in ML Pipelines
- R API
- SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for
ordinary least squares via summary(model)
- SPARK-9681 Feature interactions in R formula - Interaction operator
":" in R formula
- Python API - Many improvements to Python API to approach feature parity
Misc improvements
- SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear
Regression can take instance weights
- SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames
- Variance, stddev, correlations, etc.
- SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
- SPARK-7751 @since versions - Documentation includes initial version when
classes and methods were added
- SPARK-11337 Testable example code - Automated testing for code in user
guide examples
Deprecations
- In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
- In spark.ml.classification.LogisticRegressionModel and
spark.ml.regression.LinearRegressionModel, the "weights" field has been
deprecated, in favor of the new name "coefficients." This helps disambiguate
from instance (row) weights given to algorithms.
Changes of behavior
- spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics
in 1.6. Previously, it was a threshold for absolute change in error. Now, it
resembles the behavior of GradientDescent convergenceTol: For large errors, it
uses relative error (relative to the previous error); for small errors (<
0.01), it uses absolute error.
- spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
lowercase before tokenizing. Now, it converts to lowercase by default, with an
option not to. This matches the behavior of the simpler Tokenizer transformer.
- Spark SQL's partition discovery has been changed to only discover
partition directories that are children of the given path. (i.e. if
path="/my/data/x=1" then x=1 will no longer be considered a partition but only
children of x=1.) This behavior can be overridden by manually specifying the
basePath that partitioning discovery should start with (SPARK-11678).
- When casting a value of an integral type to timestamp (e.g. casting a long
value to timestamp), the value is treated as being in seconds instead of
milliseconds (SPARK-11724).
- With the improved query planner for queries having distinct aggregations
(SPARK-9241), the plan of a query having a single distinct aggregation has been
changed to a more robust version. To switch back to the plan generated by Spark
1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true
(SPARK-12077).