It also is a little more evidence to Jonathan's suggestion that there is a
null / 0 record that is getting grouped together.
To fix this, do i need to run a filter ?
val viEventsRaw = details.map { vi = (vi.get(14).asInstanceOf[Long],
vi) }
val viEvents = viEventsRaw.filter { case
After the above changes
1) filter shows this
Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time
DurationGC TimeInput Size / RecordsWrite TimeShuffle Write Size / Records
Errors 0 1 0 SUCCESS ANY 1 / phxaishdc9dn1571.stratus.phx.ebay.com 2015/04/20
20:55:31 7.4 min 21 s 129.7
Shuffle write could be a good indication of skew, but it looks like the
task in question hasn't generated any shuffle write yet, because its still
working on the shuffle-read side. So I wouldn't read too much into the
fact that the shuffle write is 0 for a task that is still running.
The
I'm not 100% sure of spark's implementation but in the MR frameworks, it
would have a much larger shuffle write size becasue that node is dealing
with a lot more data and as a result has a lot more to shuffle
2015-04-13 13:20 GMT-04:00 java8964 java8...@hotmail.com:
If it is really due to data
If it is really due to data skew, will the task hanging has much bigger Shuffle
Write Size in this case?
In this case, the shuffle write size for that task is 0, and the rest IO of
this task is not much larger than the fast finished tasks, is that normal?
I am also interested in this case, as
You mean there is a tuple in either RDD, that has itemID = 0 or null ?
And what is catch all ?
That implies is it a good idea to run a filter on each RDD first ? We do
not do this using Pig on M/R. Is it required in Spark world ?
On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney
I can promise you that this is also a problem in the pig world :) not sure
why it's not a problem for this data set, though... are you sure that the
two are doing the exact same code?
you should inspect your source data. Make a histogram for each and see what
the data distribution looks like. If
My guess would be data skew. Do you know if there is some item id that is a
catch all? can it be null? item id 0? lots of data sets have this sort of
value and it always kills joins
2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
Code:
val viEventsWithListings: RDD[(Long,