Hi Ondrej,

Pe 30.03.2012 14:30, Ondřej Klimpera a scris:
And one more question, is it even possible to add a MapFile (as it
consits of index and data file) to Distributed cache?
Thanks

Should be no problem, they are just two files.

On 03/30/2012 01:15 PM, Ondřej Klimpera wrote:
Hello,

I'm not sure what you mean by using map reduce setup()?

"If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job."

Can you please explain little bit more?


Check the javadocs[1]: setup is called once per task so you can read the file from HDFS then or perform other initializations.

[1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html

Reading 20 MB in ram should not be a problem and is preferred if you need to make many requests against that data. It really depends on your use case so think carefully or just go ahead and test it.


Thanks


On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote:
Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:
Hello,

I have a MapFile as a product of MapReduce job, and what I need to
do is:

1. If MapReduce produced more spilts as Output, merge them to single
file.

2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.

A MapFile is actually two files [1]: one SequanceFile (with sorted
keys) and a small index for that file. The map file does a version of
binary search to find your key and performs seek() to go to the byte
offset in the file.

What I'm trying to achieve is repeatedly fast search in this file
during
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.

If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

The distributed cache will also use HDFS [2] and I don't think it
will provide you with any benefits.

Thanks for your reply:)

Ondrej Klimpera

[1]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html

[2]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html





--
Ioan Eugen Stan
http://ieugen.blogspot.com

Reply via email to