Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

chester Tue, 01 Sep 2015 07:14:39 -0700

Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I assume it 
would cut from 1.5 branch, doesn't that bring 1.5.1 snapshot code ?


The reason I am asking these questions is that I would like to know If I want 
build 1.5.0  myself, which commit should I use ? 

Sent from my iPad

> On Sep 1, 2015, at 6:57 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
> released. In practice I think it's simpler to follow the defaults of
> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
> any 1.5.0-rc is released. It doesn't affect later RCs. This has
> nothing to do with what commits go into 1.5.0; it's an ignorable
> detail of the version in POMs in the source tree, which don't mean
> much anyway as the source tree itself is not a released version.
> 
>> On Tue, Sep 1, 2015 at 2:48 PM,  <ches...@alpinenow.com> wrote:
>> Sorry, I am still not follow. I assume the release would build from 1.5.0 
>> before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 
>> snapshot during release ? Or 1.5.0 rc3 would build from the last commit of 
>> 1.5.0 (before changing to 1.5.1 snapshot) ?
>> 
>> 
>> 
>> Sent from my iPad
>> 
>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
>>> 
>>> That's correct for the 1.5 branch, right? this doesn't mean that the
>>> next RC would have this value. You choose the release version during
>>> the release process.
>>> 
>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ches...@alpinenow.com> wrote:
>>>> Seems that Github branch-1.5 already changing the version to 
>>>> 1.5.1-SNAPSHOT,
>>>> 
>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>>>> 
>>>> Chester
>>>> 
>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <r...@databricks.com> wrote:
>>>>> 
>>>>> I'm going to -1 the release myself since the issue @yhuai identified is
>>>>> pretty serious. It basically OOMs the driver for reading any files with a
>>>>> large number of partitions. Looks like the patch for that has already been
>>>>> merged.
>>>>> 
>>>>> I'm going to cut rc3 momentarily.
>>>>> 
>>>>> 
>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sandy.r...@cloudera.com>
>>>>> wrote:
>>>>>> 
>>>>>> +1 (non-binding)
>>>>>> built from source and ran some jobs against YARN
>>>>>> 
>>>>>> -Sandy
>>>>>> 
>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <vaquar.k...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Vaquar khan
>>>>>>> 
>>>>>>> +1 (non-binding, of course)
>>>>>>> 
>>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>>>>    mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>>>>> 2. Tested pyspark, mllib
>>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>>>>> 2.3. Decision Tree, Naive Bayes OK
>>>>>>> 2.4. KMeans OK
>>>>>>>      Center And Scale OK
>>>>>>> 2.5. RDD operations OK
>>>>>>>     State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>>>>> count)
>>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>>      Model evaluation/optimization (rank, numIter, lambda) with
>>>>>>> itertools OK
>>>>>>> 3. Scala - MLlib
>>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>> 3.2. LinearRegressionWithSGD OK
>>>>>>> 3.3. Decision Tree OK
>>>>>>> 3.4. KMeans OK
>>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>> 3.6. saveAsParquetFile OK
>>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>>>>> registerTempTable, sql OK
>>>>>>> 3.8. result = sqlContext.sql("SELECT
>>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders 
>>>>>>> INNER
>>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>>>>> 4.0. Spark SQL from Python OK
>>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>>>>> OK
>>>>>>> 5.0. Packages
>>>>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>>>>> 6.0. DataFrames
>>>>>>> 6.1. cast,dtypes OK
>>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>>>>> 6.3. joins,sql,set operations,udf OK
>>>>>>> 
>>>>>>> Cheers
>>>>>>> <k/>
>>>>>>> 
>>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC 
>>>>>>>> and
>>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>> 
>>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>> 
>>>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>>>> 
>>>>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>>>> 
>>>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>>>> 
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>> 
>>>>>>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>>>>>>> found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>>>> 
>>>>>>>> The staging repository for this release (published as 1.5.0) can be
>>>>>>>> found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>>>> 
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> =======================================
>>>>>>>> How can I help test this release?
>>>>>>>> =======================================
>>>>>>>> If you are a Spark user, you can help us test this release by taking an
>>>>>>>> existing Spark workload and running on this release candidate, then
>>>>>>>> reporting any regressions.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ================================================
>>>>>>>> What justifies a -1 vote for this release?
>>>>>>>> ================================================
>>>>>>>> This vote is happening towards the end of the 1.5 QA period, so -1
>>>>>>>> votes should only occur for significant regressions from 1.4. Bugs 
>>>>>>>> already
>>>>>>>> present in 1.4, minor regressions, or bugs related to new features 
>>>>>>>> will not
>>>>>>>> block this release.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ===============================================================
>>>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>>>>> ===============================================================
>>>>>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>>>>>> branch-1.5, since documentations will be packaged separately from the
>>>>>>>> release.
>>>>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>>>>>> target version.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ==================================================
>>>>>>>> Major changes to help you focus your testing
>>>>>>>> ==================================================
>>>>>>>> 
>>>>>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>>>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>>>> 
>>>>>>>> RDD/DataFrame/SQL APIs
>>>>>>>> 
>>>>>>>> - New UDAF interface
>>>>>>>> - DataFrame hints for broadcast join
>>>>>>>> - expr function for turning a SQL expression into DataFrame column
>>>>>>>> - Improved support for NaN values
>>>>>>>> - StructType now supports ordering
>>>>>>>> - TimestampType precision is reduced to 1us
>>>>>>>> - 100 new built-in expressions, including date/time, string, math
>>>>>>>> - memory and local disk only checkpointing
>>>>>>>> 
>>>>>>>> DataFrame/SQL Backend Execution
>>>>>>>> 
>>>>>>>> - Code generation on by default
>>>>>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>>>>>> algorithms and external algorithms
>>>>>>>> - Improved window function performance
>>>>>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>>>>>> plans
>>>>>>>> 
>>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>>>> 
>>>>>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>>>>>> Standalone)
>>>>>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>>>>>> allocation, constraints)
>>>>>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>>>>>> - Improved Hive support (metastore partition pruning, metastore
>>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>>>>>> - Support persisting data in Hive compatible format in metastore
>>>>>>>> - Support data partitioning for JSON data sources
>>>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>>>>>> metadata discovery and schema merging, support reading non-standard 
>>>>>>>> legacy
>>>>>>>> Parquet files generated by other libraries)
>>>>>>>> - Faster and more robust dynamic partition insert
>>>>>>>> - DataSourceRegister interface for external data sources to specify
>>>>>>>> short names
>>>>>>>> 
>>>>>>>> SparkR
>>>>>>>> 
>>>>>>>> - YARN cluster mode in R
>>>>>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>>>>>> regularization
>>>>>>>> - Improved error messages
>>>>>>>> - Aliases to make DataFrame functions more R-like
>>>>>>>> 
>>>>>>>> Streaming
>>>>>>>> 
>>>>>>>> - Backpressure for handling bursty input streams.
>>>>>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>>>>>> Kinesis, MQTT, Flume)
>>>>>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>>>>>> linear regression, logistic regression)
>>>>>>>> - Native reliable Kinesis stream support
>>>>>>>> - Input metadata like Kafka offsets made visible in the batch details
>>>>>>>> UI
>>>>>>>> - Better load balancing and scheduling of receivers across cluster
>>>>>>>> - Include streaming storage in web UI
>>>>>>>> 
>>>>>>>> Machine Learning and Advanced Analytics
>>>>>>>> 
>>>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, 
>>>>>>>> and
>>>>>>>> VectorSlicer.
>>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>>>>>> regression.
>>>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>>>>>> sequential pattern mining, association rule generation, 1-sample
>>>>>>>> Kolmogorov-Smirnov test.
>>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>>>>>> - More efficient Pregel API implementation for GraphX
>>>>>>>> - Model summary for linear and logistic regression.
>>>>>>>> - Python API: distributed matrices, streaming k-means and linear
>>>>>>>> models, LDA, power iteration clustering, etc.
>>>>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>>>>> classification evaluator.
>>>>>>>> - Documentation: document the release version of public API methods
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Reply via email to