Hi Alex,

Why are you suggesting using SequenceFiles ? That implies changing the TeraInputFormat class, right ?

Your second approach is similar with Sort example from hadoop. The disadvantage of using it is that I don't have a total order partitioning and thus more operations are neccessary for creating the final result.

Regards,
Teodor

On 08/03/2010 12:21 AM, Alex Kozlov wrote:
Hi Teodor,

Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text
are different classes.  For the approach (1) I suggested, you need just to
construct byte[10] array from an integer and create a new Text(byte[]) and
write it together with the value to a sequence file.

Since TeraSort was specifically created for just benchmarking purposes, I
think it might make sense for you to start with the approach (2).  Just
create a SequenceFile<DoubleWritable,Text>  file with your<key,value>  data
and do a simple MR job with an identity mapper and identity reducer.  I can
send you an example of a MR code, but there are plenty out
there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
One of them is TeraSort.java:run() itself, but you may want to use the new
mapreduce 
API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>.
Once you are comfortable with the MR framework, you can optimize it further.

Another good source of information is Tom White's 'Hadoop: The Definitive
Guide', particularly on the TotalOrderPartitioner.

Let me know if you have any further questions.

Alex K

On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas<[email protected]>wrote:

Hi Alex,

Thank you again.
Yes, I'm also thinking of your first suggestion. But that would help me
only for 'reducing' the problem from floating points to integers. But I also
do not know how to use Terasort for integer keys !

I've tried to use the generic TotalOrderPartitioner instead of the one
nested in Terasort class, but I received a lot of errors [0]. I had tried to
modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
I've continued getting errors.

Now, it's not clear for me what do I have to change in order to make your
second solution working. Moreover, I was unable to find a generic MR on my
hadoop 0.20.2 version.
I'd prefer the first solution, so can you please give me some tips for how
to use Terasort for integers ?

p.s.: I've made a trick using fixed-length char keys and the program worked
for this kind of workload [1]. I think using integer keys instead of this
trick would be faster.

[0] java.io.IOException: wrong key class:
org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text

[1] it worked for this:
0000123.45 payload1
0005120.55 payload2
0000003.77 payload3
...

Best,
Teodor


On 08/02/2010 07:41 PM, Alex Kozlov wrote:

Hi Teodor,

I see the problem now:  There is no simple binary comparator for
DoubleWritable<
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
.
So you can do 2 things:

1. Convert your doubles to ints (or bytes), say if the precision is always
2
decimal points, represent the number as 100 x double:  The problem is
reduced to sorting integers then.

2. Use DoubleWritable as the key and payload as value.  You can use
generic
TotalOrderPartitioner<
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
which
does not use tries.  You also can just use a generic MR with
DoubleWritable keys: MR will sort the key for you with identity mapper and
identity reducer.

Option 2 is slightly less efficient since the code will need to call
Double.longBitsToDouble each time, but I don't see an easy way to avoid
this
with the IEEE 754 encoding.

Alex K

On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[email protected]
wrote:


Hi Alex,

Thank you for your quick reply and sorry for not being so clear.
The job I want to do is simple to sort data having numbers [doubles] as
keys [0]. I noticed that Terasort is using 10b char key. How can I use
this
for my particular job ?
Do I need to change the Terasort ?

[0] example of workload:
123.45    payload1
-34.56     payload2
752.10    payload3
10.25      payload4
....

Does this make sense now ?

Regards,
Teodor


On 08/02/2010 12:14 AM, Alex Kozlov wrote:



Hi Teodor,

I am not clear what you call 'real numbers'.  Terasort does work on
bytes
(10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
really does not matter as Hadoop uses binary comparators on the raw
value.

Total order partitioning should also work with any  WritableComparable
key
(if it doesn't, it's a bug).

My guess your problem is converting a char trie to WritableComparable.
  Can
you provide more background?  Are the strings of fixed length?

Alex K

On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[email protected]


wrote:





Hi all,


I am using hadoop 0.20.2 and I want to use sort huge amount of data.
I've
read about Terasort [from examples], but now it's using 10bytes char
keys.
Changing keys from char to integer wasn't a good solution as Terasort
builds a trie for creating total order partitions. I got stuck when I
tried
to change the char trie to a one suitable for number keys.

Then, I've given a try to Sort [also from examples] and it did work for
integer keys, but without a total order partitioning. In the end of the
day,
the final result can not be created only by putting together all
reducers'
outputs. Each reducer sorts only a subset of data and no merging is
occured
between two reducers.

Please can anyone advise me what and how to use in order to sort huge
amount of real numbers ?
Looking forward for your replies.


Thank you.
Best,
Teodor








Reply via email to