On Mon, Aug 2, 2010 at 3:41 PM, Teodor Macicas <[email protected]>wrote:
> Hi Alex, > > Why are you suggesting using SequenceFiles ? That implies changing the > TeraInputFormat class, right ? > > Because text input file will not work for arbitrary bytes that can contain new line bytes for example. Yes, the old TeraInputFormat will not work. > Your second approach is similar with Sort example from hadoop. The > disadvantage of using it is that I don't have a total order partitioning and > thus more operations are neccessary for creating the final result. > > There is a generic total order partitioner: I provided the links. See the HTDG book as well. > Regards, > Teodor > > > On 08/03/2010 12:21 AM, Alex Kozlov wrote: > >> Hi Teodor, >> >> Certainly org.apache.hadoop.io.DoubleWritable and >> org.apache.hadoop.io.Text >> are different classes. For the approach (1) I suggested, you need just to >> construct byte[10] array from an integer and create a new Text(byte[]) and >> write it together with the value to a sequence file. >> >> Since TeraSort was specifically created for just benchmarking purposes, I >> think it might make sense for you to start with the approach (2). Just >> create a SequenceFile<DoubleWritable,Text> file with your<key,value> >> data >> and do a simple MR job with an identity mapper and identity reducer. I >> can >> send you an example of a MR code, but there are plenty out >> there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>. >> >> One of them is TeraSort.java:run() itself, but you may want to use the new >> mapreduce API< >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html >> >. >> >> Once you are comfortable with the MR framework, you can optimize it >> further. >> >> Another good source of information is Tom White's 'Hadoop: The Definitive >> Guide', particularly on the TotalOrderPartitioner. >> >> Let me know if you have any further questions. >> >> Alex K >> >> On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas<[email protected] >> >wrote: >> >> >> >>> Hi Alex, >>> >>> Thank you again. >>> Yes, I'm also thinking of your first suggestion. But that would help me >>> only for 'reducing' the problem from floating points to integers. But I >>> also >>> do not know how to use Terasort for integer keys ! >>> >>> I've tried to use the generic TotalOrderPartitioner instead of the one >>> nested in Terasort class, but I received a lot of errors [0]. I had tried >>> to >>> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and >>> I've continued getting errors. >>> >>> Now, it's not clear for me what do I have to change in order to make your >>> second solution working. Moreover, I was unable to find a generic MR on >>> my >>> hadoop 0.20.2 version. >>> I'd prefer the first solution, so can you please give me some tips for >>> how >>> to use Terasort for integers ? >>> >>> p.s.: I've made a trick using fixed-length char keys and the program >>> worked >>> for this kind of workload [1]. I think using integer keys instead of this >>> trick would be faster. >>> >>> [0] java.io.IOException: wrong key class: >>> org.apache.hadoop.io.DoubleWritable is not class >>> org.apache.hadoop.io.Text >>> >>> [1] it worked for this: >>> 0000123.45 payload1 >>> 0005120.55 payload2 >>> 0000003.77 payload3 >>> ... >>> >>> Best, >>> Teodor >>> >>> >>> On 08/02/2010 07:41 PM, Alex Kozlov wrote: >>> >>> >>> >>>> Hi Teodor, >>>> >>>> I see the problem now: There is no simple binary comparator for >>>> DoubleWritable< >>>> >>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html >>>> >>>> >>>>> . >>>>> >>>>> >>>> So you can do 2 things: >>>> >>>> 1. Convert your doubles to ints (or bytes), say if the precision is >>>> always >>>> 2 >>>> decimal points, represent the number as 100 x double: The problem is >>>> reduced to sorting integers then. >>>> >>>> 2. Use DoubleWritable as the key and payload as value. You can use >>>> generic >>>> TotalOrderPartitioner< >>>> >>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html >>>> >>>> >>>>> which >>>>> >>>>> >>>> does not use tries. You also can just use a generic MR with >>>> DoubleWritable keys: MR will sort the key for you with identity mapper >>>> and >>>> identity reducer. >>>> >>>> Option 2 is slightly less efficient since the code will need to call >>>> Double.longBitsToDouble each time, but I don't see an easy way to avoid >>>> this >>>> with the IEEE 754 encoding. >>>> >>>> Alex K >>>> >>>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[email protected] >>>> >>>> >>>>> wrote: >>>>> >>>>> >>>> >>>> >>>> >>>> >>>>> Hi Alex, >>>>> >>>>> Thank you for your quick reply and sorry for not being so clear. >>>>> The job I want to do is simple to sort data having numbers [doubles] as >>>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use >>>>> this >>>>> for my particular job ? >>>>> Do I need to change the Terasort ? >>>>> >>>>> [0] example of workload: >>>>> 123.45 payload1 >>>>> -34.56 payload2 >>>>> 752.10 payload3 >>>>> 10.25 payload4 >>>>> .... >>>>> >>>>> Does this make sense now ? >>>>> >>>>> Regards, >>>>> Teodor >>>>> >>>>> >>>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Hi Teodor, >>>>>> >>>>>> I am not clear what you call 'real numbers'. Terasort does work on >>>>>> bytes >>>>>> (10 bytes key and 90 bytes payload). The actual 'meaning' of the >>>>>> bytes >>>>>> really does not matter as Hadoop uses binary comparators on the raw >>>>>> value. >>>>>> >>>>>> Total order partitioning should also work with any WritableComparable >>>>>> key >>>>>> (if it doesn't, it's a bug). >>>>>> >>>>>> My guess your problem is converting a char trie to WritableComparable. >>>>>> Can >>>>>> you provide more background? Are the strings of fixed length? >>>>>> >>>>>> Alex K >>>>>> >>>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected] >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> >>>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. >>>>>>> I've >>>>>>> read about Terasort [from examples], but now it's using 10bytes char >>>>>>> keys. >>>>>>> Changing keys from char to integer wasn't a good solution as Terasort >>>>>>> builds a trie for creating total order partitions. I got stuck when I >>>>>>> tried >>>>>>> to change the char trie to a one suitable for number keys. >>>>>>> >>>>>>> Then, I've given a try to Sort [also from examples] and it did work >>>>>>> for >>>>>>> integer keys, but without a total order partitioning. In the end of >>>>>>> the >>>>>>> day, >>>>>>> the final result can not be created only by putting together all >>>>>>> reducers' >>>>>>> outputs. Each reducer sorts only a subset of data and no merging is >>>>>>> occured >>>>>>> between two reducers. >>>>>>> >>>>>>> Please can anyone advise me what and how to use in order to sort huge >>>>>>> amount of real numbers ? >>>>>>> Looking forward for your replies. >>>>>>> >>>>>>> >>>>>>> Thank you. >>>>>>> Best, >>>>>>> Teodor >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >
