Re: [HADOOP] Terasort for numbers

Alex Kozlov Mon, 02 Aug 2010 10:42:09 -0700

Hi Teodor,

I see the problem now:  There is no simple binary comparator for
DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>.
So you can do 2 things:


1. Convert your doubles to ints (or bytes), say if the precision is always 2
decimal points, represent the number as 100 x double:  The problem is
reduced to sorting integers then.

2. Use DoubleWritable as the key and payload as value.  You can use generic
TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which
does not use tries.  You also can just use a generic MR with
DoubleWritable keys: MR will sort the key for you with identity mapper and
identity reducer.

Option 2 is slightly less efficient since the code will need to call
Double.longBitsToDouble each time, but I don't see an easy way to avoid this
with the IEEE 754 encoding.

Alex K

On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas <[email protected]>wrote:

> Hi Alex,
>
> Thank you for your quick reply and sorry for not being so clear.
> The job I want to do is simple to sort data having numbers [doubles] as
> keys [0]. I noticed that Terasort is using 10b char key. How can I use this
> for my particular job ?
> Do I need to change the Terasort ?
>
> [0] example of workload:
> 123.45    payload1
> -34.56     payload2
> 752.10    payload3
> 10.25      payload4
> ....
>
> Does this make sense now ?
>
> Regards,
> Teodor
>
>
> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>
>> Hi Teodor,
>>
>> I am not clear what you call 'real numbers'.  Terasort does work on bytes
>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>> really does not matter as Hadoop uses binary comparators on the raw value.
>>
>> Total order partitioning should also work with any  WritableComparable key
>> (if it doesn't, it's a bug).
>>
>> My guess your problem is converting a char trie to WritableComparable.
>>  Can
>> you provide more background?  Are the strings of fixed length?
>>
>> Alex K
>>
>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected]
>> >wrote:
>>
>>
>>
>>> Hi all,
>>>
>>>
>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
>>> read about Terasort [from examples], but now it's using 10bytes char
>>> keys.
>>> Changing keys from char to integer wasn't a good solution as Terasort
>>> builds a trie for creating total order partitions. I got stuck when I
>>> tried
>>> to change the char trie to a one suitable for number keys.
>>>
>>> Then, I've given a try to Sort [also from examples] and it did work for
>>> integer keys, but without a total order partitioning. In the end of the
>>> day,
>>> the final result can not be created only by putting together all
>>> reducers'
>>> outputs. Each reducer sorts only a subset of data and no merging is
>>> occured
>>> between two reducers.
>>>
>>> Please can anyone advise me what and how to use in order to sort huge
>>> amount of real numbers ?
>>> Looking forward for your replies.
>>>
>>>
>>> Thank you.
>>> Best,
>>> Teodor
>>>
>>>
>>>
>>
>

Re: [HADOOP] Terasort for numbers

Reply via email to