Re: Comparison between Tuple compare & WritableComparabale compare

Alan Gates Thu, 22 May 2008 14:02:39 -0700

Clearly we should be thinking about exec time. And having to load oneless bag into memory should greatly reduce exec time, at least in thecase where we can't fit that bag into memory and we have to spill. Ihave no idea of how to compare and say which is a better performance gain.


A few thoughts:

1) We're in the boat of using tuples anytime a user groups, cogroups, orsorts on more than one column and for all distincts, correct? So wehave this problem at least in some cases, no matter what.

2) In the previous code, we had switched from using the tuple objectcomparator to using a binary comparator provided by hadoop. This gaveus a large speed up. Are we still using that binary comparator?

3) We need to take a look at the tuple and see what is taking so long.Are we spending time constructing the tuples vs hadoopsWritableComparable types, time comparing them, etc.


Alan.

Shravan Narayanamurthy wrote:

I completely messed up the calculation of speed reduction. Sorry. The 30 to 40 
times speed reduction in comparison time leads to the same reduction in speed 
even when we do n log n comparisons :)

Still don't you think its a high price to pay just to go from n to n-1 bags. I agree that memory savings can be huge but shouldn't we also be thinking about exec time?Thanks,

--Shravan

________________________________

From: Shravan Narayanamurthy
Sent: Thu 5/22/2008 11:35 PM
To: Alan Gates
Subject: Comparison between Tuple compare & WritableComparabale compare

Hi Alan,
Comparing the times to compare two WritableComparables a million with
the time to compare the same objects when embedded in a Tuple. Also the
Tuple has two elements. First one is the index and the second one is the
actual object:

BOOLEAN : Tuple :: 14.16 : 602.76
BYTEARRAY : Tuple :: 53.94 : 414.06
CHARARRAY : Tuple :: 50.9 : 417.86
FLOAT : Tuple :: 20.2 : 655.4
INTEGER : Tuple :: 14.24 : 539.3
LONG : Tuple :: 16.08 : 578.6

The numbers surely look depressing. I was wondering if its a good idea
to do the (n-1) bag optimization at all. Because with just adding two
inputs into the cogroup, would make us send tuples as keys and this
incurring nearly 30 to 40 times reduced speed just for comparing. Since
we are sorting we will do n log n comparisons thus incurring 150 to 200
times reduction in speed. Joins being pretty commonly used, I feel we
should avoid this optimization.

Thanks,
--Shravan

Re: Comparison between Tuple compare & WritableComparabale compare

Reply via email to