Re: [HADOOP] Terasort for numbers

Alex Kozlov Mon, 02 Aug 2010 15:51:55 -0700

On Mon, Aug 2, 2010 at 3:41 PM, Teodor Macicas <[email protected]>wrote:


> Hi Alex,
>
> Why are you suggesting using SequenceFiles ? That implies changing the
> TeraInputFormat class, right ?
>
>
Because text input file will not work for arbitrary bytes that can contain
new line bytes for example.  Yes, the old TeraInputFormat will not work.


> Your second approach is similar with Sort example from hadoop. The
> disadvantage of using it is that I don't have a total order partitioning and
> thus more operations are neccessary for creating the final result.
>
>
There is a generic total order partitioner: I provided the links.  See the
HTDG book as well.


> Regards,
> Teodor
>
>
> On 08/03/2010 12:21 AM, Alex Kozlov wrote:
>
>> Hi Teodor,
>>
>> Certainly org.apache.hadoop.io.DoubleWritable and
>> org.apache.hadoop.io.Text
>> are different classes.  For the approach (1) I suggested, you need just to
>> construct byte[10] array from an integer and create a new Text(byte[]) and
>> write it together with the value to a sequence file.
>>
>> Since TeraSort was specifically created for just benchmarking purposes, I
>> think it might make sense for you to start with the approach (2).  Just
>> create a SequenceFile<DoubleWritable,Text>  file with your<key,value>
>>  data
>> and do a simple MR job with an identity mapper and identity reducer.  I
>> can
>> send you an example of a MR code, but there are plenty out
>> there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
>>
>> One of them is TeraSort.java:run() itself, but you may want to use the new
>> mapreduce API<
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html
>> >.
>>
>> Once you are comfortable with the MR framework, you can optimize it
>> further.
>>
>> Another good source of information is Tom White's 'Hadoop: The Definitive
>> Guide', particularly on the TotalOrderPartitioner.
>>
>> Let me know if you have any further questions.
>>
>> Alex K
>>
>> On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas<[email protected]
>> >wrote:
>>
>>
>>
>>> Hi Alex,
>>>
>>> Thank you again.
>>> Yes, I'm also thinking of your first suggestion. But that would help me
>>> only for 'reducing' the problem from floating points to integers. But I
>>> also
>>> do not know how to use Terasort for integer keys !
>>>
>>> I've tried to use the generic TotalOrderPartitioner instead of the one
>>> nested in Terasort class, but I received a lot of errors [0]. I had tried
>>> to
>>> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
>>> I've continued getting errors.
>>>
>>> Now, it's not clear for me what do I have to change in order to make your
>>> second solution working. Moreover, I was unable to find a generic MR on
>>> my
>>> hadoop 0.20.2 version.
>>> I'd prefer the first solution, so can you please give me some tips for
>>> how
>>> to use Terasort for integers ?
>>>
>>> p.s.: I've made a trick using fixed-length char keys and the program
>>> worked
>>> for this kind of workload [1]. I think using integer keys instead of this
>>> trick would be faster.
>>>
>>> [0] java.io.IOException: wrong key class:
>>> org.apache.hadoop.io.DoubleWritable is not class
>>> org.apache.hadoop.io.Text
>>>
>>> [1] it worked for this:
>>> 0000123.45 payload1
>>> 0005120.55 payload2
>>> 0000003.77 payload3
>>> ...
>>>
>>> Best,
>>> Teodor
>>>
>>>
>>> On 08/02/2010 07:41 PM, Alex Kozlov wrote:
>>>
>>>
>>>
>>>> Hi Teodor,
>>>>
>>>> I see the problem now:  There is no simple binary comparator for
>>>> DoubleWritable<
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
>>>>
>>>>
>>>>> .
>>>>>
>>>>>
>>>> So you can do 2 things:
>>>>
>>>> 1. Convert your doubles to ints (or bytes), say if the precision is
>>>> always
>>>> 2
>>>> decimal points, represent the number as 100 x double:  The problem is
>>>> reduced to sorting integers then.
>>>>
>>>> 2. Use DoubleWritable as the key and payload as value.  You can use
>>>> generic
>>>> TotalOrderPartitioner<
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
>>>>
>>>>
>>>>> which
>>>>>
>>>>>
>>>> does not use tries.  You also can just use a generic MR with
>>>> DoubleWritable keys: MR will sort the key for you with identity mapper
>>>> and
>>>> identity reducer.
>>>>
>>>> Option 2 is slightly less efficient since the code will need to call
>>>> Double.longBitsToDouble each time, but I don't see an easy way to avoid
>>>> this
>>>> with the IEEE 754 encoding.
>>>>
>>>> Alex K
>>>>
>>>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[email protected]
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi Alex,
>>>>>
>>>>> Thank you for your quick reply and sorry for not being so clear.
>>>>> The job I want to do is simple to sort data having numbers [doubles] as
>>>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use
>>>>> this
>>>>> for my particular job ?
>>>>> Do I need to change the Terasort ?
>>>>>
>>>>> [0] example of workload:
>>>>> 123.45    payload1
>>>>> -34.56     payload2
>>>>> 752.10    payload3
>>>>> 10.25      payload4
>>>>> ....
>>>>>
>>>>> Does this make sense now ?
>>>>>
>>>>> Regards,
>>>>> Teodor
>>>>>
>>>>>
>>>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Teodor,
>>>>>>
>>>>>> I am not clear what you call 'real numbers'.  Terasort does work on
>>>>>> bytes
>>>>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the
>>>>>> bytes
>>>>>> really does not matter as Hadoop uses binary comparators on the raw
>>>>>> value.
>>>>>>
>>>>>> Total order partitioning should also work with any  WritableComparable
>>>>>> key
>>>>>> (if it doesn't, it's a bug).
>>>>>>
>>>>>> My guess your problem is converting a char trie to WritableComparable.
>>>>>>  Can
>>>>>> you provide more background?  Are the strings of fixed length?
>>>>>>
>>>>>> Alex K
>>>>>>
>>>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data.
>>>>>>> I've
>>>>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>>>>> keys.
>>>>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>>>>> builds a trie for creating total order partitions. I got stuck when I
>>>>>>> tried
>>>>>>> to change the char trie to a one suitable for number keys.
>>>>>>>
>>>>>>> Then, I've given a try to Sort [also from examples] and it did work
>>>>>>> for
>>>>>>> integer keys, but without a total order partitioning. In the end of
>>>>>>> the
>>>>>>> day,
>>>>>>> the final result can not be created only by putting together all
>>>>>>> reducers'
>>>>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>>>>> occured
>>>>>>> between two reducers.
>>>>>>>
>>>>>>> Please can anyone advise me what and how to use in order to sort huge
>>>>>>> amount of real numbers ?
>>>>>>> Looking forward for your replies.
>>>>>>>
>>>>>>>
>>>>>>> Thank you.
>>>>>>> Best,
>>>>>>> Teodor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: [HADOOP] Terasort for numbers

Reply via email to