Hi Ondrej,

Pe 02.04.2012 13:00, Ondřej Klimpera a scris:
Ok, thanks.

I missed setup() method because of using older version of hadoop, so I
suppose that method configure() does the same in hadoop 0.20.203.

Aha, if it's possible, try upgrading. I don't know how support is for versions older then hadoop 0.20 branch.

Now I'm able to load a map file inside configure() method to
MapFile.Reader instance as a class private variable, all works fine,
just wondering if the MapFile is replicated on HDFS and data are read
locally, or if reading from this file will increase the network
bandwidth because of getting it's data from another computer node in the
hadoop cluster.


You could use a method variable instead of a class private if you load the file. If the MapFile is wrote to HDFS then yes it is replicated, and you can configure the replication factor at file creation (and later maybe). If you use DistributedCache then the files are not written in HDFS, but in mapred.local.dir [1] folder on every node. The folder size is configurable so it's possible that the data will be available there for the next MR job but don't rely on this.

Please read the docs, I may get things wrong. RTFM will save you life ;).

[1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
[2] https://forums.aws.amazon.com/message.jspa?messageID=152538

Hopefully last question to bother you is, if reading files from
DistributedCache (normal text file) is limited to particular job.
Before running a job I add a file to DistCache. When getting the file in
Reducer implementation, can it access DistCache files from another jobs?
In another words what will list this command:

//Reducer impl.
public void configure(JobConf job) {

URI[] distCacheFileUris = DistributedCache.getCacheFiles(job);

}

will the distCacheFileUris variable contain only URIs for this job, or
for any job running on Hadoop cluster?

Hope it's understandable.
Thanks.


It's

--
Ioan Eugen Stan
http://ieugen.blogspot.com

Reply via email to