Not sure if it’s too late, but we found a critical bug at
https://issues.apache.org/jira/browse/SPARK-10466
UnsafeRow ser/de will cause assert error, particularly for sort-based shuffle
with data spill, this is not acceptable as it’s very common in a large table
joins.
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Saturday, September 5, 2015 3:30 PM
To: Krishna Sankar
Cc: Davies Liu; Yin Huai; Tom Graves; dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 1.5.0 (RC3)
Thanks, Krishna, for the report. We should fix your problem using the Python
UDFs in 1.6 too.
I'm going to close this vote now. Thanks everybody for voting. This vote passes
with 8 +1 votes (3 binding) and no 0 or -1 votes.
+1:
Reynold Xin*
Tom Graves*
Burak Yavuz
Michael Armbrust*
Davies Liu
Forest Fang
Krishna Sankar
Denny Lee
0:
-1:
I will work on packaging this release in the next few days.
On Fri, Sep 4, 2015 at 8:08 PM, Krishna Sankar
> wrote:
Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
This was exactly why I had put in the elapsed time calculations.
And thanks for the new pyspark.sql.functions.
+1 from my side for 1.5.0 RC3.
Cheers
On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu
> wrote:
Could you update the notebook to use builtin SQL function month and year,
instead of Python UDF? (they are introduced in 1.5).
Once remove those two udfs, it runs successfully, also much faster.
On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar
> wrote:
> Yin,
>It is the
> https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> Cheers
>
>
> On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai
> > wrote:
>>
>> Hi Krishna,
>>
>> Can you share your code to reproduce the memory allocation issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar
>> >
>> wrote:
>>>
>>> Thanks Tom. Interestingly it happened between RC2 and RC3.
>>> Now my vote is +1/2 unless the memory error is known and has a
>>> workaround.
>>>
>>> Cheers
>>>
>>>
>>>
>>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves
>>> > wrote:
The upper/lower case thing is known.
https://issues.apache.org/jira/browse/SPARK-9550
I assume it was decided to be ok and its going to be in the release
notes but Reynold or Josh can probably speak to it more.
Tom
On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
> wrote:
+?
1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
Center And Scale OK
2.5. RDD operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word
count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
Two Problems:
1. The synthetic column names are lowercase ( i.e. now
‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
previously 'AVG(Total)'). So programs that depend on the case of the
synthetic column names would fail.
2. orders_3.groupBy("Year","Month").sum('Total').show()
fails with the error ‘java.io.IOException: Unable to acquire 4194304
bytes of memory’
orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
with the same error
Is this a known bug ?
Cheers