Re: [HADOOP] Terasort for numbers

Alex Kozlov Mon, 02 Aug 2010 15:22:11 -0700

Hi Teodor,

Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text
are different classes.  For the approach (1) I suggested, you need just to
construct byte[10] array from an integer and create a new Text(byte[]) and
write it together with the value to a sequence file.


Since TeraSort was specifically created for just benchmarking purposes, I
think it might make sense for you to start with the approach (2).  Just
create a SequenceFile<DoubleWritable,Text> file with your <key,value> data
and do a simple MR job with an identity mapper and identity reducer.  I can
send you an example of a MR code, but there are plenty out
there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
One of them is TeraSort.java:run() itself, but you may want to use the new
mapreduce 
API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>.
Once you are comfortable with the MR framework, you can optimize it further.

Another good source of information is Tom White's 'Hadoop: The Definitive
Guide', particularly on the TotalOrderPartitioner.

Let me know if you have any further questions.

Alex K

On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas <[email protected]>wrote:

> Hi Alex,
>
> Thank you again.
> Yes, I'm also thinking of your first suggestion. But that would help me
> only for 'reducing' the problem from floating points to integers. But I also
> do not know how to use Terasort for integer keys !
>
> I've tried to use the generic TotalOrderPartitioner instead of the one
> nested in Terasort class, but I received a lot of errors [0]. I had tried to
> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
> I've continued getting errors.
>
> Now, it's not clear for me what do I have to change in order to make your
> second solution working. Moreover, I was unable to find a generic MR on my
> hadoop 0.20.2 version.
> I'd prefer the first solution, so can you please give me some tips for how
> to use Terasort for integers ?
>
> p.s.: I've made a trick using fixed-length char keys and the program worked
> for this kind of workload [1]. I think using integer keys instead of this
> trick would be faster.
>
> [0] java.io.IOException: wrong key class:
> org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text
>
> [1] it worked for this:
> 0000123.45 payload1
> 0005120.55 payload2
> 0000003.77 payload3
> ...
>
> Best,
> Teodor
>
>
> On 08/02/2010 07:41 PM, Alex Kozlov wrote:
>
>> Hi Teodor,
>>
>> I see the problem now:  There is no simple binary comparator for
>> DoubleWritable<
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
>> >.
>>
>> So you can do 2 things:
>>
>> 1. Convert your doubles to ints (or bytes), say if the precision is always
>> 2
>> decimal points, represent the number as 100 x double:  The problem is
>> reduced to sorting integers then.
>>
>> 2. Use DoubleWritable as the key and payload as value.  You can use
>> generic
>> TotalOrderPartitioner<
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
>> >which
>>
>> does not use tries.  You also can just use a generic MR with
>> DoubleWritable keys: MR will sort the key for you with identity mapper and
>> identity reducer.
>>
>> Option 2 is slightly less efficient since the code will need to call
>> Double.longBitsToDouble each time, but I don't see an easy way to avoid
>> this
>> with the IEEE 754 encoding.
>>
>> Alex K
>>
>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[email protected]
>> >wrote:
>>
>>
>>
>>> Hi Alex,
>>>
>>> Thank you for your quick reply and sorry for not being so clear.
>>> The job I want to do is simple to sort data having numbers [doubles] as
>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use
>>> this
>>> for my particular job ?
>>> Do I need to change the Terasort ?
>>>
>>> [0] example of workload:
>>> 123.45    payload1
>>> -34.56     payload2
>>> 752.10    payload3
>>> 10.25      payload4
>>> ....
>>>
>>> Does this make sense now ?
>>>
>>> Regards,
>>> Teodor
>>>
>>>
>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>>
>>>
>>>
>>>> Hi Teodor,
>>>>
>>>> I am not clear what you call 'real numbers'.  Terasort does work on
>>>> bytes
>>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>>>> really does not matter as Hadoop uses binary comparators on the raw
>>>> value.
>>>>
>>>> Total order partitioning should also work with any  WritableComparable
>>>> key
>>>> (if it doesn't, it's a bug).
>>>>
>>>> My guess your problem is converting a char trie to WritableComparable.
>>>>  Can
>>>> you provide more background?  Are the strings of fixed length?
>>>>
>>>> Alex K
>>>>
>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected]
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data.
>>>>> I've
>>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>>> keys.
>>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>>> builds a trie for creating total order partitions. I got stuck when I
>>>>> tried
>>>>> to change the char trie to a one suitable for number keys.
>>>>>
>>>>> Then, I've given a try to Sort [also from examples] and it did work for
>>>>> integer keys, but without a total order partitioning. In the end of the
>>>>> day,
>>>>> the final result can not be created only by putting together all
>>>>> reducers'
>>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>>> occured
>>>>> between two reducers.
>>>>>
>>>>> Please can anyone advise me what and how to use in order to sort huge
>>>>> amount of real numbers ?
>>>>> Looking forward for your replies.
>>>>>
>>>>>
>>>>> Thank you.
>>>>> Best,
>>>>> Teodor
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: [HADOOP] Terasort for numbers

Reply via email to