+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:32 min
mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 1.6.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
Center And Scale OK
2.5. RDD operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.3.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
Cheers & Good work guys
<k/>
On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <[email protected]>
wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
> bugs in eviction of storage memory by execution.
> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
> passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
> Performance - Improve Parquet scan performance when using flat schemas.
> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
> Session Management - Isolated devault database (i.e USE mydb) even on
> shared clusters.
> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
> API - A type-safe API (similar to RDDs) that performs many operations
> on serialized binary data and code generation (i.e. Project Tungsten).
> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
> Memory Management - Shared memory for execution and caching instead of
> exclusive division of the regions.
> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
> Queries on Files - Concise syntax for running SQL queries over files
> of any supported format without registering a table.
> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
> non-standard JSON files - Added options to read non-standard JSON
> files (e.g. single-quotes, unquoted attributes)
> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
> Per-operator
> Metrics for SQL Execution - Display statistics on a peroperator basis
> for memory usage and spilled data size.
> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
> (*) expansion for StructTypes - Makes it easier to nest and unest
> arbitrary numbers of columns
> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
> Columnar Cache Performance - Significant (up to 14x) speed up when
> caching data that contains complex types in DataFrames or SQL.
> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
> null-safe joins - Joins using null-safe equality (<=>) will now
> execute using SortMergeJoin instead of computing a cartisian product.
> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
> Execution Using Off-Heap Memory - Support for configuring query
> execution to occur using off-heap memory to avoid GC overhead
> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
> Datasource
> API Avoid Double Filter - When implemeting a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
> Layout of Cached Data - storing partitioning and ordering schemes in
> In-memory table scan, and adding distributeBy and localSort to DF API
> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
> query execution - Intial support for automatically selecting the
> number of reducers for joins and aggregations.
> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
> query planner for queries having distinct aggregations - Query plans
> of distinct aggregations are more robust when distinct columns have high
> cardinality.
>
> Spark Streaming
>
> - API Updates
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
> improved state management - mapWithState - a DStream transformation
> for stateful stream processing, supercedes updateStateByKey in
> functionality and performance.
> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
> Kinesis
> record deaggregation - Kinesis streams have been upgraded to use
> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated
> records.
> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
> Kinesis
> message handler function - Allows arbitraray function to be applied
> to a Kinesis record in the Kinesis receiver before to customize what
> data
> is to be stored in memory.
> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
> Streamng Listener API - Get streaming statistics (scheduling
> delays, batch processing times, etc.) in streaming.
>
>
> - UI Improvements
> - Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> - Made output operations visible in the streaming tab as progress
> bars.
>
> MLlibNew algorithms/models
>
> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
> analysis - Log-linear model for survival analysis
> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
> equation for least squares - Normal equation solver, providing R-like
> model summary statistics
> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
> hypothesis testing - A/B testing in the Spark Streaming framework
> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
> K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
> - ML Pipelines
> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
> Pipeline
> persistence - Save/load for ML Pipelines, with partial coverage of
> spark.mlalgorithms
> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
> in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
> - R API
> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
> statistics for GLMs - (Partial) R-like stats for ordinary least
> squares via summary(model)
> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
> interactions in R formula - Interaction operator ":" in R formula
> - Python API - Many improvements to Python API to approach feature
> parity
>
> Misc improvements
>
> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
> weights for GLMs - Logistic and Linear Regression can take instance
> weights
> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
> and bivariate statistics in DataFrames - Variance, stddev,
> correlations, etc.
> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
> data source - LIBSVM as a SQL data sourceDocumentation improvements
> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
> versions - Documentation includes initial version when classes and
> methods were added
> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
> example code - Automated testing for code in user guide examples
>
> Deprecations
>
> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> - In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps
> disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute change in
> error. Now, it resembles the behavior of GradientDescent convergenceTol:
> For large errors, it uses relative error (relative to the previous error);
> for small errors (< 0.01), it uses absolute error.
> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to lowercase by
> default, with an option not to. This matches the behavior of the simpler
> Tokenizer transformer.
> - Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition
> but only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with (
> SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
> - When casting a value of an integral type to timestamp (e.g. casting
> a long value to timestamp), the value is treated as being in seconds
> instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> - With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a more
> robust version. To switch back to the plan generated by Spark 1.5's
> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>