Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread james
I saw a new "spark.shuffle.manager=tungsten-sort" implemented in
https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its
corresponding description in
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty
there are only 'sort' and 'hash' two options).



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13984.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread Cheng, Hao
Not sure if it’s too late, but we found a critical bug at 
https://issues.apache.org/jira/browse/SPARK-10466
UnsafeRow ser/de will cause assert error, particularly for sort-based shuffle 
with data spill, this is not acceptable as it’s very common in a large table 
joins.

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Saturday, September 5, 2015 3:30 PM
To: Krishna Sankar
Cc: Davies Liu; Yin Huai; Tom Graves; dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Thanks, Krishna, for the report. We should fix your problem using the Python 
UDFs in 1.6 too.

I'm going to close this vote now. Thanks everybody for voting. This vote passes 
with 8 +1 votes (3 binding) and no 0 or -1 votes.

+1:
Reynold Xin*
Tom Graves*
Burak Yavuz
Michael Armbrust*
Davies Liu
Forest Fang
Krishna Sankar
Denny Lee

0:

-1:


I will work on packaging this release in the next few days.



On Fri, Sep 4, 2015 at 8:08 PM, Krishna Sankar 
> wrote:
Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
This was exactly why I had put in the elapsed time calculations.
And thanks for the new pyspark.sql.functions.

+1 from my side for 1.5.0 RC3.
Cheers


On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu 
> wrote:
Could you update the notebook to use builtin SQL function month and year,
instead of Python UDF? (they are introduced in 1.5).

Once remove those two udfs, it runs successfully, also much faster.

On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar 
> wrote:
> Yin,
>It is the
> https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> Cheers
> 
>
> On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai 
> > wrote:
>>
>> Hi Krishna,
>>
>> Can you share your code to reproduce the memory allocation issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar 
>> >
>> wrote:
>>>
>>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>>> Now my vote is +1/2 unless the memory error is known and has a
>>> workaround.
>>>
>>> Cheers
>>> 
>>>
>>>
>>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves 
>>> > wrote:

 The upper/lower case thing is known.
 https://issues.apache.org/jira/browse/SPARK-9550
 I assume it was decided to be ok and its going to be in the release
 notes  but Reynold or Josh can probably speak to it more.

 Tom



 On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
 > wrote:


 +?

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word
 count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql("SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
 OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK
 (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
 com.databricks:spark-csv_2.11:1.2.0 worked)
 6.0. DataFrames
 6.1. cast,dtypes OK
 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
 6.3. All joins,sql,set operations,udf OK

 Two Problems:

 1. The synthetic column names are lowercase ( i.e. now
 ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
 previously 'AVG(Total)'). So programs that depend on the case of the
 synthetic column names would fail.
 2. orders_3.groupBy("Year","Month").sum('Total').show()
 fails with the error ‘java.io.IOException: Unable to acquire 4194304
 bytes of memory’
 orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
 with the same error
 Is this a known bug ?
 Cheers