Hi Teodor, I see the problem now: There is no simple binary comparator for DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>. So you can do 2 things:
1. Convert your doubles to ints (or bytes), say if the precision is always 2 decimal points, represent the number as 100 x double: The problem is reduced to sorting integers then. 2. Use DoubleWritable as the key and payload as value. You can use generic TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which does not use tries. You also can just use a generic MR with DoubleWritable keys: MR will sort the key for you with identity mapper and identity reducer. Option 2 is slightly less efficient since the code will need to call Double.longBitsToDouble each time, but I don't see an easy way to avoid this with the IEEE 754 encoding. Alex K On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas <[email protected]>wrote: > Hi Alex, > > Thank you for your quick reply and sorry for not being so clear. > The job I want to do is simple to sort data having numbers [doubles] as > keys [0]. I noticed that Terasort is using 10b char key. How can I use this > for my particular job ? > Do I need to change the Terasort ? > > [0] example of workload: > 123.45 payload1 > -34.56 payload2 > 752.10 payload3 > 10.25 payload4 > .... > > Does this make sense now ? > > Regards, > Teodor > > > On 08/02/2010 12:14 AM, Alex Kozlov wrote: > >> Hi Teodor, >> >> I am not clear what you call 'real numbers'. Terasort does work on bytes >> (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes >> really does not matter as Hadoop uses binary comparators on the raw value. >> >> Total order partitioning should also work with any WritableComparable key >> (if it doesn't, it's a bug). >> >> My guess your problem is converting a char trie to WritableComparable. >> Can >> you provide more background? Are the strings of fixed length? >> >> Alex K >> >> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected] >> >wrote: >> >> >> >>> Hi all, >>> >>> >>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've >>> read about Terasort [from examples], but now it's using 10bytes char >>> keys. >>> Changing keys from char to integer wasn't a good solution as Terasort >>> builds a trie for creating total order partitions. I got stuck when I >>> tried >>> to change the char trie to a one suitable for number keys. >>> >>> Then, I've given a try to Sort [also from examples] and it did work for >>> integer keys, but without a total order partitioning. In the end of the >>> day, >>> the final result can not be created only by putting together all >>> reducers' >>> outputs. Each reducer sorts only a subset of data and no merging is >>> occured >>> between two reducers. >>> >>> Please can anyone advise me what and how to use in order to sort huge >>> amount of real numbers ? >>> Looking forward for your replies. >>> >>> >>> Thank you. >>> Best, >>> Teodor >>> >>> >>> >> >
