Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Tom Graves
+1. Tested on Yarn with Hadoop 2.6. 
A few of the things tested: pyspark, hive integration, aux shuffle handler, 
history server, basic submit cli behavior, distributed cache behavior, cluster 
and client mode...
Tom 


 On Tuesday, September 1, 2015 3:42 PM, Reynold Xin  
wrote:
   

 Please vote on releasing the following candidate as Apache Spark version 
1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a 
majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.0[ ] -1 Do not release this 
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v1.5.0-rc3:https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
The release files, including signatures, digests, etc. can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
Release artifacts are signed with the following 
key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release (published as 1.5.0-rc3) can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1143/
The staging repository for this release (published as 1.5.0) can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1142/
The documentation corresponding to this release can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/

===How can I help test this 
release?===If you are a Spark user, you can 
help us test this release by taking an existing Spark workload and running on 
this release candidate, then reporting any regressions.

What justifies a -1 vote for 
this release?This vote is 
happening towards the end of the 1.5 QA period, so -1 votes should only occur 
for significant regressions from 1.4. Bugs already present in 1.4, minor 
regressions, or bugs related to new features will not block this release.

===What should 
happen to JIRA tickets still targeting 
1.5.0?===1. It is 
OK for documentation patches to target 1.5.0 and still go into branch-1.5, 
since documentations will be packaged separately from the release.2. New 
features for non-alpha-modules should target 1.6+.3. Non-blocker bug fixes 
should target 1.5.1 or 1.6.0, or drop the target version.

==Major changes to help you 
focus your testing==
As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. 
I've curated a list of important changes for 1.5. For the complete list, please 
refer to Apache JIRA changelog.
RDD/DataFrame/SQL APIs
- New UDAF interface- DataFrame hints for broadcast join- expr function for 
turning a SQL expression into DataFrame column- Improved support for NaN 
values- StructType now supports ordering- TimestampType precision is reduced to 
1us- 100 new built-in expressions, including date/time, string, math- memory 
and local disk only checkpointing
DataFrame/SQL Backend Execution
- Code generation on by default- Improved join, aggregation, shuffle, sorting 
with cache friendly algorithms and external algorithms- Improved window 
function performance- Better metrics instrumentation and reporting for DF/SQL 
execution plans
Data Sources, Hive, Hadoop, Mesos and Cluster Management
- Dynamic allocation support in all resource managers (Mesos, YARN, 
Standalone)- Improved Mesos support (framework authentication, roles, dynamic 
allocation, constraints)- Improved YARN support (dynamic allocation with 
preferred locations)- Improved Hive support (metastore partition pruning, 
metastore connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)- Support 
persisting data in Hive compatible format in metastore- Support data 
partitioning for JSON data sources- Parquet improvements (upgrade to 1.7, 
predicate pushdown, faster metadata discovery and schema merging, support 
reading non-standard legacy Parquet files generated by other libraries)- Faster 
and more robust dynamic partition insert- DataSourceRegister interface for 
external data sources to specify short names
SparkR
- YARN cluster mode in R- GLMs with R formula, binomial/Gaussian families, and 
elastic-net regularization- Improved error messages- Aliases to make DataFrame 
functions more R-like
Streaming
- Backpressure for handling bursty input streams.- Improved Python support for 
streaming sources (Kafka offsets, Kinesis, MQTT, Flume)- Improved Python 
streaming machine learning algorithms (K-Means, linear regression, logistic 
regression)- Native reliable Kinesis stream support- Input metadata like Kafka 
offsets made 

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Burak Yavuz
+1. Tested complex R package support (Scala + R code), BLAS and DataFrame
fixes good.

Burak

On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman  wrote:

> Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
> Standalone without any problems. Re-tested dynamic allocation specifically.
>
> "Lost executor" messages are still an annoyance since they're expected to
> occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
> however there's already a JIRA ticket for it:
> https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
> filter these messages out in log4j properties for this release!
>
> Mark.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread saurfang
+1. Compiled on Windows with YARN and Hive. Tested Tungsten aggregation and
observed similar (good) performance comparing to 1.4 with unsafe on. Ran a
few workloads and tested SparkSQL thrift server



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13953.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



EOFException on History server reading in progress lz4

2015-09-03 Thread andrew.rowson
I'm trying to solve a problem of the history server spamming my logs with
EOFExceptions when it tries to read a history file from HDFS that is both
lz4 compressed and incomplete. The actual exception is:

java.io.EOFException: Stream ended prematurely
at
net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:218)
at
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:192)
at
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67
)
at
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:
55)
at
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$hi
story$FsHistoryProvider$$replay(FsHistoryProvider.scala:443)
at
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistor
yProvider.scala:278)
at
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistor
yProvider.scala:275)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.sc
ala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.sc
ala:251)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:5
9)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$hi
story$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:275
)
at
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$a
non$2.run(FsHistoryProvider.scala:209)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
42)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
17)
at java.lang.Thread.run(Thread.java:745)

The bit I'm struggling with is handling this in ReplayListenerBus.scala - I
tried adding the following to the try/catch:

case eof: java.io.EOFException =>
logWarning(s"EOFException (probably due to incomplete lz4) at
$sourceName", eof)

but this never seems to get triggered - it still dumps the whole exception
out to the log.

I feel like there's something basic I'm missing for the exception not to be
caught by the try/catch in ReplayListenerBus. Can anyone point me in the
right direction?

Thanks,

Andrew


smime.p7s
Description: S/MIME cryptographic signature


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Michael Armbrust
+1 Ran TPC-DS and ported several jobs over to 1.5

On Thu, Sep 3, 2015 at 9:57 AM, Burak Yavuz  wrote:

> +1. Tested complex R package support (Scala + R code), BLAS and DataFrame
> fixes good.
>
> Burak
>
> On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman 
> wrote:
>
>> Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
>> Standalone without any problems. Re-tested dynamic allocation
>> specifically.
>>
>> "Lost executor" messages are still an annoyance since they're expected to
>> occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
>> however there's already a JIRA ticket for it:
>> https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
>> filter these messages out in log4j properties for this release!
>>
>> Mark.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Davies Liu
+1, built 1.5 from source and ran TPC-DS locally and clusters, ran
performance benchmark for aggregation and join with difference scales,
all worked well.

On Thu, Sep 3, 2015 at 10:05 AM, Michael Armbrust
 wrote:
> +1 Ran TPC-DS and ported several jobs over to 1.5
>
> On Thu, Sep 3, 2015 at 9:57 AM, Burak Yavuz  wrote:
>>
>> +1. Tested complex R package support (Scala + R code), BLAS and DataFrame
>> fixes good.
>>
>> Burak
>>
>> On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman 
>> wrote:
>>>
>>> Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
>>> Standalone without any problems. Re-tested dynamic allocation
>>> specifically.
>>>
>>> "Lost executor" messages are still an annoyance since they're expected to
>>> occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
>>> however there's already a JIRA ticket for it:
>>> https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
>>> filter these messages out in log4j properties for this release!
>>>
>>> Mark.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread mkhaitman
Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
Standalone without any problems. Re-tested dynamic allocation specifically. 

"Lost executor" messages are still an annoyance since they're expected to
occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
however there's already a JIRA ticket for it:
https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
filter these messages out in log4j properties for this release!

Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL sort by and collect by in multiple partitions

2015-09-03 Thread Vishnu Kumar
Hi,

Yes this is intended behavior. "ORDER BY" guarantees the total order in
output while  "SORT BY" guarantees the order within a partition.


Vishnu

On Thu, Sep 3, 2015 at 10:49 AM, Niranda Perera 
wrote:

> Hi all,
>
> I have been using sort by and order by in spark sql and I observed the
> following
>
> when using SORT BY and collect results, the results are getting sorted
> partition by partition.
> example:
> if we have 1, 2, ... , 12 and 4 partitions and I want to sort it in
> descending order,
> partition 0 (p0) would have 12, 8, 4
> p1 = 11, 7, 3
> p2 = 10, 6, 2
> p3 = 9, 5, 1
>
> so collect() would return 12, 8, 4, 11, 7, 3, 10, 6, 2, 9, 5, 1
>
> BUT when I use ORDER BY and collect results
> p0 = 12, 11, 10
> p1 =  9, 8, 7
> .
> so collect() would return 12, 11, .., 1 which is the desirable result.
>
> is this the intended behavior of SORT BY and ORDER BY or is there
> something I'm missing?
>
> cheers
>
> --
> Niranda
> @n1r44 
> https://pythagoreanscript.wordpress.com/
>


Re: [HELP] Spark 1.4.1 tasks take ridiculously long time to complete

2015-09-03 Thread robineast
I would suggest you move this to the Spark User list, this is the development
list for discussion on development of Spark. It would help if you could give
some more information about what you are trying to do e.g. what code you are
running, how you submitted the job (spark-shell, spark-submit) and what sort
of cluster (standalone, Yarn, Mesos)





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HELP-Spark-1-4-1-tasks-take-ridiculously-long-time-to-complete-tp13942p13946.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Code generation for GPU

2015-09-03 Thread Reynold Xin
See responses inline.

On Thu, Sep 3, 2015 at 1:58 AM, kiran lonikar  wrote:

> Hi,
>
>1. I found where the code generation
>
> 
>  happens
>in spark code from the blogs
>
> https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html,
>
>
> https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
>and
>
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html.
>However, I could not find where is the generated code executed? A major
>part of my changes will be there since this executor will now have to send
>vectors of columns to GPU RAM, invoke execution, and get the results back
>to CPU RAM. Thus, the existing executor will significantly change.
>
> The code generation generates Java classes that have an apply method, and
the apply method is called in the operators.

E.g. GenerateUnsafeProjection returns a Projection class (which is just a
class with an apply method), and TungstenProject calls that class.



>
>1. On the project tungsten blog
>
> ,
>in the third Code Generation section, it is mentioned that you plan
>to increase the level of code generation from record-at-a-time expression
>evaluation to vectorized expression evaluation. Has this been implemented?
>If not, how do I implement this? I will need access to columnar ByteBuffer
>objects in DataFrame to do this. Having row by row access to data will
>defeat this exercise. In particular, I need access to
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
>in the executor of the generated code.
>
>
This is future work. You'd need to create batches of rows or columns. This
is a pretty major refactoring though.


>
>1. One thing that confuses me is the changes from 1.4 to 1.5 possibly
>due to JIRA https://issues.apache.org/jira/browse/SPARK-7956 and pull
>request https://github.com/apache/spark/pull/6479/files*. *This
>changed the code generation from quasiquotes (q) to string s operator. This
>makes it simpler for me to generate OpenCL code which is string based. The
>question, is this branch stable now? Should I make my changes on spark 1.4
>or spark 1.5 or master branch?
>
> In general Spark development velocity is pretty high, as we make a lot of
changes to internals every release. If I were you, I'd use either master or
branch-1.5 for your prototyping.


>
>1. How do I tune the batch size (number of rows in the ByteBuffer)? Is
>it through the property spark.sql.inMemoryColumnarStorage.batchSize?
>
>
> Thanks in anticipation,
>
> Kiran
> PS:
>
> Other things I found useful were:
>
> *Spark DataFrames*: https://www.brighttalk.com/webcast/12891/166495
> *Apache Spark 1.5*: https://www.brighttalk.com/webcast/12891/168177
>
> The links to JavaCL/ScalaCL:
>
> *Library to execute OpenCL code through Java*:
> https://github.com/nativelibs4java/ScalaCL
> *Library to convert Scala code to OpenCL and execute on GPUs*:
> https://github.com/nativelibs4java/JavaCL
>
>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Denny Lee
+1

Distinct count test is blazing fast - awesome!,

On Thu, Sep 3, 2015 at 8:21 PM Krishna Sankar  wrote:

> +?
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK
> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
> com.databricks:spark-csv_2.11:1.2.0 worked)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
>
> Two Problems:
>
> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
> So programs that depend on the case of the synthetic column names would
> fail.
> 2. orders_3.groupBy("Year","Month").sum('Total').show()
> fails with the error ‘java.io.IOException: Unable to acquire 4194304
> bytes of memory’
> orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with
> the same error
> Is this a known bug ?
> Cheers
> 
> P.S: Sorry for the spam, forgot Reply All
>
> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc3:
>>
>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc3) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==
>> Major changes to help you focus your testing
>> ==
>>
>> As of today, Spark 1.5 contains 

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Krishna Sankar
+?

1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK

Two Problems:

1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
So programs that depend on the case of the synthetic column names would
fail.
2. orders_3.groupBy("Year","Month").sum('Total').show()
fails with the error ‘java.io.IOException: Unable to acquire 4194304
bytes of memory’
orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with
the same error
Is this a known bug ?
Cheers

P.S: Sorry for the spam, forgot Reply All

On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc3:
>
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc3) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1143/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1142/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> 
> What justifies a -1 vote for this release?
> 
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===
> What should happen to JIRA tickets still targeting 1.5.0?
> ===
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==
> Major changes to help you focus your testing
> ==
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression