Please vote on releasing the following candidate as Apache Spark version
1.6.0!
The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is _v1.6.0-rc4
(4062cda3087ae42c6c3cb24508fc1d3a931accdf)
<https://github.com/apache/spark/tree/v1.6.0-rc4>_
The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1176/
The test repository (versioned as v1.6.0-rc4) for this release can be
found at:
https://repository.apache.org/content/repositories/orgapachespark-1175/
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.
================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will
not block this release.
===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
target version.
==================================================
== Major changes to help you focus your testing ==
==================================================
Notable changes since 1.6 RC3
- SPARK-12404 - Fix serialization error for Datasets with
Timestamps/Arrays/Decimal
- SPARK-12218 - Fix incorrect pushdown of filters to parquet
- SPARK-12395 - Fix join columns of outer join for DataFrame using
- SPARK-12413 - Fix mesos HA
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
* SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
|trackStateByKey| has been renamed to |mapWithState|
Spark SQL
* SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
bugs in eviction of storage memory by execution.
* SPARK-12258
<https://issues.apache.org/jira/browse/SPARK-12258> correct passing
null into ScalaUDF
Notable Features Since 1.5
Spark SQL
* SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
Parquet Performance - Improve Parquet scan performance when using
flat schemas.
* SPARK-10810
<https://issues.apache.org/jira/browse/SPARK-10810>Session
Management - Isolated devault database (i.e |USE mydb|) even on
shared clusters.
* SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
Dataset API - A type-safe API (similar to RDDs) that performs many
operations on serialized binary data and code generation (i.e.
Project Tungsten).
* SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
Unified Memory Management - Shared memory for execution and caching
instead of exclusive division of the regions.
* SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
Queries on Files - Concise syntax for running SQL queries over files
of any supported format without registering a table.
* SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
Reading non-standard JSON files - Added options to read non-standard
JSON files (e.g. single-quotes, unquoted attributes)
* SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator Metrics for SQL Execution - Display statistics on a
peroperator basis for memory usage and spilled data size.
* SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
(*) expansion for StructTypes - Makes it easier to nest and unest
arbitrary numbers of columns
* SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
In-memory Columnar Cache Performance - Significant (up to 14x) speed
up when caching data that contains complex types in DataFrames or
SQL.
* SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
null-safe joins - Joins using null-safe equality (|<=>|) will now
execute using SortMergeJoin instead of computing a cartisian
product.
* SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
Execution Using Off-Heap Memory - Support for configuring query
execution to occur using off-heap memory to avoid GC overhead
* SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
Datasource API Avoid Double Filter - When implemeting a datasource
with filter pushdown, developers can now tell Spark SQL to avoid
double evaluating a pushed-down filter.
* SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
Advanced Layout of Cached Data - storing partitioning and ordering
schemes in In-memory table scan, and adding distributeBy and
localSort to DF API
* SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
Adaptive query execution - Intial support for automatically
selecting the number of reducers for joins and aggregations.
* SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
Improved query planner for queries having distinct aggregations -
Query plans of distinct aggregations are more robust when distinct
columns have high cardinality.
Spark Streaming
* API Updates
o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
New improved state management - |mapWithState| - a DStream
transformation for stateful stream processing, supercedes
|updateStateByKey| in functionality and performance.
o SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
Kinesis record deaggregation - Kinesis streams have been
upgraded to use KCL 1.4.0 and supports transparent deaggregation
of KPL-aggregated records.
o SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
Kinesis message handler function - Allows arbitraray function to
be applied to a Kinesis record in the Kinesis receiver before to
customize what data is to be stored in memory.
o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
Python Streamng Listener API - Get streaming statistics
(scheduling delays, batch processing times, etc.) in streaming.
* UI Improvements
o Made failures visible in the streaming tab, in the timelines,
batch list, and batch details page.
o Made output operations visible in the streaming tab as progress
bars.
MLlib
New algorithms/models
* SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
Survival analysis - Log-linear model for survival analysis
* SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
equation for least squares - Normal equation solver, providing
R-like model summary statistics
* SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
hypothesis testing - A/B testing in the Spark Streaming framework
* SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
transformer
* SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
Bisecting K-Means clustering - Fast top-down clustering variant of
K-Means
API improvements
* ML Pipelines
o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
Pipeline persistence - Save/load for ML Pipelines, with partial
coverage of spark.ml <http://spark.ml/>algorithms
o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
Pipelines
* R API
o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
R-like statistics for GLMs - (Partial) R-like stats for ordinary
least squares via summary(model)
o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
Feature interactions in R formula - Interaction operator ":" in
R formula
* Python API - Many improvements to Python API to approach feature
parity
Misc improvements
* SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
Instance weights for GLMs - Logistic and Linear Regression can take
instance weights
* SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
Univariate and bivariate statistics in DataFrames - Variance,
stddev, correlations, etc.
* SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
* SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
versions - Documentation includes initial version when classes and
methods were added
* SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
Testable example code - Automated testing for code in user guide
examples
Deprecations
* In spark.mllib.clustering.KMeans, the "runs" parameter has been
deprecated.
* In spark.ml.classification.LogisticRegressionModel and
spark.ml.regression.LinearRegressionModel, the "weights" field has
been deprecated, in favor of the new name "coefficients." This helps
disambiguate from instance (row) weights given to algorithms.
Changes of behavior
* spark.mllib.tree.GradientBoostedTrees validationTol has changed
semantics in 1.6. Previously, it was a threshold for absolute change
in error. Now, it resembles the behavior of GradientDescent
convergenceTol: For large errors, it uses relative error (relative
to the previous error); for small errors (< 0.01), it uses absolute
error.
* spark.ml.feature.RegexTokenizer: Previously, it did not convert
strings to lowercase before tokenizing. Now, it converts to
lowercase by default, with an option not to. This matches the
behavior of the simpler Tokenizer transformer.
* Spark SQL's partition discovery has been changed to only discover
partition directories that are children of the given path. (i.e. if
|path="/my/data/x=1"| then |x=1| will no longer be considered a
partition but only children of |x=1|.) This behavior can be
overridden by manually specifying the |basePath| that partitioning
discovery should start with (SPARK-11678
<https://issues.apache.org/jira/browse/SPARK-11678>).
* When casting a value of an integral type to timestamp (e.g. casting
a long value to timestamp), the value is treated as being in seconds
instead of milliseconds (SPARK-11724
<https://issues.apache.org/jira/browse/SPARK-11724>).
* With the improved query planner for queries having distinct
aggregations (SPARK-9241
<https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
query having a single distinct aggregation has been changed to a
more robust version. To switch back to the plan generated by Spark
1.5's planner, please set
|spark.sql.specializeSingleDistinctAggPlanning| to
|true| (SPARK-12077
<https://issues.apache.org/jira/browse/SPARK-12077>).