+1
On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee <denny.g....@gmail.com
<mailto:denny.g....@gmail.com>> wrote:
+1
On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson <ilike...@gmail.com
<mailto:ilike...@gmail.com>> wrote:
+1
On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen
<joshro...@databricks.com <mailto:joshro...@databricks.com>>
wrote:
+1
On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang
<zjf...@gmail.com <mailto:zjf...@gmail.com>> wrote:
+1
On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra
<m...@clearstorydata.com
<mailto:m...@clearstorydata.com>> wrote:
+1
On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust
<mich...@databricks.com
<mailto:mich...@databricks.com>> wrote:
Please vote on releasing the following
candidate as Apache Spark version 1.6.0!
The vote is open until Friday, December 25,
2015 at 18:00 UTC and passes if a majority of
at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see
http://spark.apache.org/
The tag to be voted on is _v1.6.0-rc4
(4062cda3087ae42c6c3cb24508fc1d3a931accdf)
<https://github.com/apache/spark/tree/v1.6.0-rc4>_
The release files, including signatures,
digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
<http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc4-bin/>
Release artifacts are signed with the
following key:
https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be
found at:
https://repository.apache.org/content/repositories/orgapachespark-1176/
The test repository (versioned as v1.6.0-rc4)
for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1175/
The documentation corresponding to this
release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
<http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc4-docs/>
=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test
this release by taking an existing Spark
workload and running on this release
candidate, then reporting any regressions.
================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the
1.6 QA period, so -1 votes should only occur
for significant regressions from 1.5. Bugs
already present in 1.5, minor regressions, or
bugs related to new features will not block
this release.
===============================================================
== What should happen to JIRA tickets still
targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to
target 1.6.0 and still go into branch-1.6,
since documentations will be published
separately from the release.
2. New features for non-alpha-modules should
target 1.7+.
3. Non-blocker bug fixes should target 1.6.1
or 1.7.0, or drop the target version.
==================================================
== Major changes to help you focus your testing ==
==================================================
Notable changes since 1.6 RC3
- SPARK-12404 - Fix serialization error for
Datasets with Timestamps/Arrays/Decimal
- SPARK-12218 - Fix incorrect pushdown of
filters to parquet
- SPARK-12395 - Fix join columns of outer join
for DataFrame using
- SPARK-12413 - Fix mesos HA
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
* SPARK-2629
<https://issues.apache.org/jira/browse/SPARK-2629>
|trackStateByKey| has been renamed to
|mapWithState|
Spark SQL
* SPARK-12165
<https://issues.apache.org/jira/browse/SPARK-12165>
SPARK-12189
<https://issues.apache.org/jira/browse/SPARK-12189>
Fix
bugs in eviction of storage memory by
execution.
* SPARK-12258
<https://issues.apache.org/jira/browse/SPARK-12258>
correct
passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
* SPARK-11787
<https://issues.apache.org/jira/browse/SPARK-11787>
Parquet Performance - Improve Parquet scan
performance when using flat schemas.
* SPARK-10810
<https://issues.apache.org/jira/browse/SPARK-10810>Session
Management - Isolated devault database
(i.e |USE mydb|) even on shared clusters.
* SPARK-9999
<https://issues.apache.org/jira/browse/SPARK-9999>
Dataset API - A type-safe API (similar to
RDDs) that performs many operations on
serialized binary data and code generation
(i.e. Project Tungsten).
* SPARK-10000
<https://issues.apache.org/jira/browse/SPARK-10000>
Unified Memory Management - Shared memory
for execution and caching instead of
exclusive division of the regions.
* SPARK-11197
<https://issues.apache.org/jira/browse/SPARK-11197>
SQL Queries on Files - Concise syntax for
running SQL queries over files of any
supported format without registering a table.
* SPARK-11745
<https://issues.apache.org/jira/browse/SPARK-11745>
Reading non-standard JSON files - Added
options to read non-standard JSON files
(e.g. single-quotes, unquoted attributes)
* SPARK-10412
<https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator Metrics for SQL Execution -
Display statistics on a peroperator basis
for memory usage and spilled data size.
* SPARK-11329
<https://issues.apache.org/jira/browse/SPARK-11329>
Star (*) expansion for StructTypes - Makes
it easier to nest and unest arbitrary
numbers of columns
* SPARK-10917
<https://issues.apache.org/jira/browse/SPARK-10917>,
SPARK-11149
<https://issues.apache.org/jira/browse/SPARK-11149>
In-memory Columnar Cache Performance -
Significant (up to 14x) speed up when
caching data that contains complex types
in DataFrames or SQL.
* SPARK-11111
<https://issues.apache.org/jira/browse/SPARK-11111>
Fast null-safe joins - Joins using
null-safe equality (|<=>|) will now
execute using SortMergeJoin instead of
computing a cartisian product.
* SPARK-11389
<https://issues.apache.org/jira/browse/SPARK-11389>
SQL Execution Using Off-Heap Memory -
Support for configuring query execution to
occur using off-heap memory to avoid GC
overhead
* SPARK-10978
<https://issues.apache.org/jira/browse/SPARK-10978>
Datasource API Avoid Double Filter - When
implemeting a datasource with filter
pushdown, developers can now tell Spark
SQL to avoid double evaluating a
pushed-down filter.
* SPARK-4849
<https://issues.apache.org/jira/browse/SPARK-4849>
Advanced Layout of Cached Data - storing
partitioning and ordering schemes in
In-memory table scan, and adding
distributeBy and localSort to DF API
* SPARK-9858
<https://issues.apache.org/jira/browse/SPARK-9858>
Adaptive query execution - Intial support
for automatically selecting the number of
reducers for joins and aggregations.
* SPARK-9241
<https://issues.apache.org/jira/browse/SPARK-9241>
Improved query planner for queries having
distinct aggregations - Query plans of
distinct aggregations are more robust when
distinct columns have high cardinality.
Spark Streaming
* API Updates
o SPARK-2629
<https://issues.apache.org/jira/browse/SPARK-2629>
New improved state management -
|mapWithState| - a DStream
transformation for stateful stream
processing, supercedes
|updateStateByKey| in functionality
and performance.
o SPARK-11198
<https://issues.apache.org/jira/browse/SPARK-11198>
Kinesis record deaggregation - Kinesis
streams have been upgraded to use KCL
1.4.0 and supports transparent
deaggregation of KPL-aggregated records.
o SPARK-10891
<https://issues.apache.org/jira/browse/SPARK-10891>
Kinesis message handler function -
Allows arbitraray function to be
applied to a Kinesis record in the
Kinesis receiver before to customize
what data is to be stored in memory.
o SPARK-6328
<https://issues.apache.org/jira/browse/SPARK-6328>
Python Streamng Listener API - Get
streaming statistics (scheduling
delays, batch processing times, etc.)
in streaming.
* UI Improvements
o Made failures visible in the streaming
tab, in the timelines, batch list, and
batch details page.
o Made output operations visible in the
streaming tab as progress bars.
MLlib
New algorithms/models
* SPARK-8518
<https://issues.apache.org/jira/browse/SPARK-8518>
Survival analysis - Log-linear model for
survival analysis
* SPARK-9834
<https://issues.apache.org/jira/browse/SPARK-9834>
Normal equation for least squares - Normal
equation solver, providing R-like model
summary statistics
* SPARK-3147
<https://issues.apache.org/jira/browse/SPARK-3147>
Online hypothesis testing - A/B testing in
the Spark Streaming framework
* SPARK-9930
<https://issues.apache.org/jira/browse/SPARK-9930>
New feature transformers - ChiSqSelector,
QuantileDiscretizer, SQL transformer
* SPARK-6517
<https://issues.apache.org/jira/browse/SPARK-6517>
Bisecting K-Means clustering - Fast
top-down clustering variant of K-Means
API improvements
* ML Pipelines
o SPARK-6725
<https://issues.apache.org/jira/browse/SPARK-6725>
Pipeline persistence - Save/load for
ML Pipelines, with partial coverage of
spark.ml <http://spark.ml/>algorithms
o SPARK-5565
<https://issues.apache.org/jira/browse/SPARK-5565>
LDA in ML Pipelines - API for Latent
Dirichlet Allocation in ML Pipelines
* R API
o SPARK-9836
<https://issues.apache.org/jira/browse/SPARK-9836>
R-like statistics for GLMs - (Partial)
R-like stats for ordinary least
squares via summary(model)
o SPARK-9681
<https://issues.apache.org/jira/browse/SPARK-9681>
Feature interactions in R formula -
Interaction operator ":" in R formula
* Python API - Many improvements to Python
API to approach feature parity
Misc improvements
* SPARK-7685
<https://issues.apache.org/jira/browse/SPARK-7685>,
SPARK-9642
<https://issues.apache.org/jira/browse/SPARK-9642>
Instance weights for GLMs - Logistic and
Linear Regression can take instance weights
* SPARK-10384
<https://issues.apache.org/jira/browse/SPARK-10384>,
SPARK-10385
<https://issues.apache.org/jira/browse/SPARK-10385>
Univariate and bivariate statistics in
DataFrames - Variance, stddev,
correlations, etc.
* SPARK-10117
<https://issues.apache.org/jira/browse/SPARK-10117>
LIBSVM data source - LIBSVM as a SQL data
source
Documentation improvements
* SPARK-7751
<https://issues.apache.org/jira/browse/SPARK-7751>
@since versions - Documentation includes
initial version when classes and methods
were added
* SPARK-11337
<https://issues.apache.org/jira/browse/SPARK-11337>
Testable example code - Automated testing
for code in user guide examples
Deprecations
* In spark.mllib.clustering.KMeans, the
"runs" parameter has been deprecated.
* In
spark.ml.classification.LogisticRegressionModel
and
spark.ml.regression.LinearRegressionModel,
the "weights" field has been deprecated,
in favor of the new name "coefficients."
This helps disambiguate from instance
(row) weights given to algorithms.
Changes of behavior
* spark.mllib.tree.GradientBoostedTrees
validationTol has changed semantics in
1.6. Previously, it was a threshold for
absolute change in error. Now, it
resembles the behavior of GradientDescent
convergenceTol: For large errors, it uses
relative error (relative to the previous
error); for small errors (< 0.01), it uses
absolute error.
* spark.ml.feature.RegexTokenizer:
Previously, it did not convert strings to
lowercase before tokenizing. Now, it
converts to lowercase by default, with an
option not to. This matches the behavior
of the simpler Tokenizer transformer.
* Spark SQL's partition discovery has been
changed to only discover partition
directories that are children of the given
path. (i.e. if |path="/my/data/x=1"| then
|x=1| will no longer be considered a
partition but only children of |x=1|.)
This behavior can be overridden by
manually specifying the |basePath| that
partitioning discovery should start with
(SPARK-11678
<https://issues.apache.org/jira/browse/SPARK-11678>).
* When casting a value of an integral type
to timestamp (e.g. casting a long value to
timestamp), the value is treated as being
in seconds instead of milliseconds
(SPARK-11724
<https://issues.apache.org/jira/browse/SPARK-11724>).
* With the improved query planner for
queries having distinct aggregations
(SPARK-9241
<https://issues.apache.org/jira/browse/SPARK-9241>),
the plan of a query having a single
distinct aggregation has been changed to a
more robust version. To switch back to the
plan generated by Spark 1.5's planner,
please set
|spark.sql.specializeSingleDistinctAggPlanning| to
|true| (SPARK-12077
<https://issues.apache.org/jira/browse/SPARK-12077>).
--
Best Regards
Jeff Zhang