Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-20 Thread ๏̯͡๏
It also is a little more evidence to Jonathan's suggestion that there is a null / 0 record that is getting grouped together. To fix this, do i need to run a filter ? val viEventsRaw = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) } val viEvents = viEventsRaw.filter { case

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-20 Thread ๏̯͡๏
After the above changes 1) filter shows this Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time DurationGC TimeInput Size / RecordsWrite TimeShuffle Write Size / Records Errors 0 1 0 SUCCESS ANY 1 / phxaishdc9dn1571.stratus.phx.ebay.com 2015/04/20 20:55:31 7.4 min 21 s 129.7

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-14 Thread Imran Rashid
Shuffle write could be a good indication of skew, but it looks like the task in question hasn't generated any shuffle write yet, because its still working on the shuffle-read side. So I wouldn't read too much into the fact that the shuffle write is 0 for a task that is still running. The

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
I'm not 100% sure of spark's implementation but in the MR frameworks, it would have a much larger shuffle write size becasue that node is dealing with a lot more data and as a result has a lot more to shuffle 2015-04-13 13:20 GMT-04:00 java8964 java8...@hotmail.com: If it is really due to data

RE: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread java8964
If it is really due to data skew, will the task hanging has much bigger Shuffle Write Size in this case? In this case, the shuffle write size for that task is 0, and the rest IO of this task is not much larger than the fast finished tasks, is that normal? I am also interested in this case, as

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread ๏̯͡๏
You mean there is a tuple in either RDD, that has itemID = 0 or null ? And what is catch all ? That implies is it a good idea to run a filter on each RDD first ? We do not do this using Pig on M/R. Is it required in Spark world ? On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
I can promise you that this is also a problem in the pig world :) not sure why it's not a problem for this data set, though... are you sure that the two are doing the exact same code? you should inspect your source data. Make a histogram for each and see what the data distribution looks like. If

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots of data sets have this sort of value and it always kills joins 2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: Code: val viEventsWithListings: RDD[(Long,