Hi Teodor, Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text are different classes. For the approach (1) I suggested, you need just to construct byte[10] array from an integer and create a new Text(byte[]) and write it together with the value to a sequence file.
Since TeraSort was specifically created for just benchmarking purposes, I think it might make sense for you to start with the approach (2). Just create a SequenceFile<DoubleWritable,Text> file with your <key,value> data and do a simple MR job with an identity mapper and identity reducer. I can send you an example of a MR code, but there are plenty out there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>. One of them is TeraSort.java:run() itself, but you may want to use the new mapreduce API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>. Once you are comfortable with the MR framework, you can optimize it further. Another good source of information is Tom White's 'Hadoop: The Definitive Guide', particularly on the TotalOrderPartitioner. Let me know if you have any further questions. Alex K On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas <[email protected]>wrote: > Hi Alex, > > Thank you again. > Yes, I'm also thinking of your first suggestion. But that would help me > only for 'reducing' the problem from floating points to integers. But I also > do not know how to use Terasort for integer keys ! > > I've tried to use the generic TotalOrderPartitioner instead of the one > nested in Terasort class, but I received a lot of errors [0]. I had tried to > modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and > I've continued getting errors. > > Now, it's not clear for me what do I have to change in order to make your > second solution working. Moreover, I was unable to find a generic MR on my > hadoop 0.20.2 version. > I'd prefer the first solution, so can you please give me some tips for how > to use Terasort for integers ? > > p.s.: I've made a trick using fixed-length char keys and the program worked > for this kind of workload [1]. I think using integer keys instead of this > trick would be faster. > > [0] java.io.IOException: wrong key class: > org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text > > [1] it worked for this: > 0000123.45 payload1 > 0005120.55 payload2 > 0000003.77 payload3 > ... > > Best, > Teodor > > > On 08/02/2010 07:41 PM, Alex Kozlov wrote: > >> Hi Teodor, >> >> I see the problem now: There is no simple binary comparator for >> DoubleWritable< >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html >> >. >> >> So you can do 2 things: >> >> 1. Convert your doubles to ints (or bytes), say if the precision is always >> 2 >> decimal points, represent the number as 100 x double: The problem is >> reduced to sorting integers then. >> >> 2. Use DoubleWritable as the key and payload as value. You can use >> generic >> TotalOrderPartitioner< >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html >> >which >> >> does not use tries. You also can just use a generic MR with >> DoubleWritable keys: MR will sort the key for you with identity mapper and >> identity reducer. >> >> Option 2 is slightly less efficient since the code will need to call >> Double.longBitsToDouble each time, but I don't see an easy way to avoid >> this >> with the IEEE 754 encoding. >> >> Alex K >> >> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[email protected] >> >wrote: >> >> >> >>> Hi Alex, >>> >>> Thank you for your quick reply and sorry for not being so clear. >>> The job I want to do is simple to sort data having numbers [doubles] as >>> keys [0]. I noticed that Terasort is using 10b char key. How can I use >>> this >>> for my particular job ? >>> Do I need to change the Terasort ? >>> >>> [0] example of workload: >>> 123.45 payload1 >>> -34.56 payload2 >>> 752.10 payload3 >>> 10.25 payload4 >>> .... >>> >>> Does this make sense now ? >>> >>> Regards, >>> Teodor >>> >>> >>> On 08/02/2010 12:14 AM, Alex Kozlov wrote: >>> >>> >>> >>>> Hi Teodor, >>>> >>>> I am not clear what you call 'real numbers'. Terasort does work on >>>> bytes >>>> (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes >>>> really does not matter as Hadoop uses binary comparators on the raw >>>> value. >>>> >>>> Total order partitioning should also work with any WritableComparable >>>> key >>>> (if it doesn't, it's a bug). >>>> >>>> My guess your problem is converting a char trie to WritableComparable. >>>> Can >>>> you provide more background? Are the strings of fixed length? >>>> >>>> Alex K >>>> >>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected] >>>> >>>> >>>>> wrote: >>>>> >>>>> >>>> >>>> >>>> >>>> >>>>> Hi all, >>>>> >>>>> >>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. >>>>> I've >>>>> read about Terasort [from examples], but now it's using 10bytes char >>>>> keys. >>>>> Changing keys from char to integer wasn't a good solution as Terasort >>>>> builds a trie for creating total order partitions. I got stuck when I >>>>> tried >>>>> to change the char trie to a one suitable for number keys. >>>>> >>>>> Then, I've given a try to Sort [also from examples] and it did work for >>>>> integer keys, but without a total order partitioning. In the end of the >>>>> day, >>>>> the final result can not be created only by putting together all >>>>> reducers' >>>>> outputs. Each reducer sorts only a subset of data and no merging is >>>>> occured >>>>> between two reducers. >>>>> >>>>> Please can anyone advise me what and how to use in order to sort huge >>>>> amount of real numbers ? >>>>> Looking forward for your replies. >>>>> >>>>> >>>>> Thank you. >>>>> Best, >>>>> Teodor >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >
