Hi Alex,
Thank you again.
Yes, I'm also thinking of your first suggestion. But that would help me
only for 'reducing' the problem from floating points to integers. But I
also do not know how to use Terasort for integer keys !
I've tried to use the generic TotalOrderPartitioner instead of the one
nested in Terasort class, but I received a lot of errors [0]. I had
tried to modify the TeraInputFormat, TeraOutputFormat (and all nested
classes) and I've continued getting errors.
Now, it's not clear for me what do I have to change in order to make
your second solution working. Moreover, I was unable to find a generic
MR on my hadoop 0.20.2 version.
I'd prefer the first solution, so can you please give me some tips for
how to use Terasort for integers ?
p.s.: I've made a trick using fixed-length char keys and the program
worked for this kind of workload [1]. I think using integer keys instead
of this trick would be faster.
[0] java.io.IOException: wrong key class:
org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text
[1] it worked for this:
0000123.45 payload1
0005120.55 payload2
0000003.77 payload3
...
Best,
Teodor
On 08/02/2010 07:41 PM, Alex Kozlov wrote:
Hi Teodor,
I see the problem now: There is no simple binary comparator for
DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>.
So you can do 2 things:
1. Convert your doubles to ints (or bytes), say if the precision is always 2
decimal points, represent the number as 100 x double: The problem is
reduced to sorting integers then.
2. Use DoubleWritable as the key and payload as value. You can use generic
TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which
does not use tries. You also can just use a generic MR with
DoubleWritable keys: MR will sort the key for you with identity mapper and
identity reducer.
Option 2 is slightly less efficient since the code will need to call
Double.longBitsToDouble each time, but I don't see an easy way to avoid this
with the IEEE 754 encoding.
Alex K
On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[email protected]>wrote:
Hi Alex,
Thank you for your quick reply and sorry for not being so clear.
The job I want to do is simple to sort data having numbers [doubles] as
keys [0]. I noticed that Terasort is using 10b char key. How can I use this
for my particular job ?
Do I need to change the Terasort ?
[0] example of workload:
123.45 payload1
-34.56 payload2
752.10 payload3
10.25 payload4
....
Does this make sense now ?
Regards,
Teodor
On 08/02/2010 12:14 AM, Alex Kozlov wrote:
Hi Teodor,
I am not clear what you call 'real numbers'. Terasort does work on bytes
(10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes
really does not matter as Hadoop uses binary comparators on the raw value.
Total order partitioning should also work with any WritableComparable key
(if it doesn't, it's a bug).
My guess your problem is converting a char trie to WritableComparable.
Can
you provide more background? Are the strings of fixed length?
Alex K
On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected]
wrote:
Hi all,
I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
read about Terasort [from examples], but now it's using 10bytes char
keys.
Changing keys from char to integer wasn't a good solution as Terasort
builds a trie for creating total order partitions. I got stuck when I
tried
to change the char trie to a one suitable for number keys.
Then, I've given a try to Sort [also from examples] and it did work for
integer keys, but without a total order partitioning. In the end of the
day,
the final result can not be created only by putting together all
reducers'
outputs. Each reducer sorts only a subset of data and no merging is
occured
between two reducers.
Please can anyone advise me what and how to use in order to sort huge
amount of real numbers ?
Looking forward for your replies.
Thank you.
Best,
Teodor