Clearly we should be thinking about exec time. And having to load one
less bag into memory should greatly reduce exec time, at least in the
case where we can't fit that bag into memory and we have to spill. I
have no idea of how to compare and say which is a better performance gain.
A few thoughts:
1) We're in the boat of using tuples anytime a user groups, cogroups, or
sorts on more than one column and for all distincts, correct? So we
have this problem at least in some cases, no matter what.
2) In the previous code, we had switched from using the tuple object
comparator to using a binary comparator provided by hadoop. This gave
us a large speed up. Are we still using that binary comparator?
3) We need to take a look at the tuple and see what is taking so long.
Are we spending time constructing the tuples vs hadoops
WritableComparable types, time comparing them, etc.
Alan.
Shravan Narayanamurthy wrote:
I completely messed up the calculation of speed reduction. Sorry. The 30 to 40
times speed reduction in comparison time leads to the same reduction in speed
even when we do n log n comparisons :)
Still don't you think its a high price to pay just to go from n to n-1 bags. I agree that memory savings can be huge but shouldn't we also be thinking about exec time?
Thanks,
--Shravan
________________________________
From: Shravan Narayanamurthy
Sent: Thu 5/22/2008 11:35 PM
To: Alan Gates
Subject: Comparison between Tuple compare & WritableComparabale compare
Hi Alan,
Comparing the times to compare two WritableComparables a million with
the time to compare the same objects when embedded in a Tuple. Also the
Tuple has two elements. First one is the index and the second one is the
actual object:
BOOLEAN : Tuple :: 14.16 : 602.76
BYTEARRAY : Tuple :: 53.94 : 414.06
CHARARRAY : Tuple :: 50.9 : 417.86
FLOAT : Tuple :: 20.2 : 655.4
INTEGER : Tuple :: 14.24 : 539.3
LONG : Tuple :: 16.08 : 578.6
The numbers surely look depressing. I was wondering if its a good idea
to do the (n-1) bag optimization at all. Because with just adding two
inputs into the cogroup, would make us send tuples as keys and this
incurring nearly 30 to 40 times reduced speed just for comparing. Since
we are sorting we will do n log n comparisons thus incurring 150 to 200
times reduction in speed. Joins being pretty commonly used, I feel we
should avoid this optimization.
Thanks,
--Shravan