Hi Ondrej,
Pe 02.04.2012 13:00, Ondřej Klimpera a scris:
Ok, thanks.
I missed setup() method because of using older version of hadoop, so I
suppose that method configure() does the same in hadoop 0.20.203.
Aha, if it's possible, try upgrading. I don't know how support is for
versions older then hadoop 0.20 branch.
Now I'm able to load a map file inside configure() method to
MapFile.Reader instance as a class private variable, all works fine,
just wondering if the MapFile is replicated on HDFS and data are read
locally, or if reading from this file will increase the network
bandwidth because of getting it's data from another computer node in the
hadoop cluster.
You could use a method variable instead of a class private if you load
the file. If the MapFile is wrote to HDFS then yes it is replicated, and
you can configure the replication factor at file creation (and later
maybe). If you use DistributedCache then the files are not written in
HDFS, but in mapred.local.dir [1] folder on every node.
The folder size is configurable so it's possible that the data will be
available there for the next MR job but don't rely on this.
Please read the docs, I may get things wrong. RTFM will save you life ;).
[1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
[2] https://forums.aws.amazon.com/message.jspa?messageID=152538
Hopefully last question to bother you is, if reading files from
DistributedCache (normal text file) is limited to particular job.
Before running a job I add a file to DistCache. When getting the file in
Reducer implementation, can it access DistCache files from another jobs?
In another words what will list this command:
//Reducer impl.
public void configure(JobConf job) {
URI[] distCacheFileUris = DistributedCache.getCacheFiles(job);
}
will the distCacheFileUris variable contain only URIs for this job, or
for any job running on Hadoop cluster?
Hope it's understandable.
Thanks.
It's
--
Ioan Eugen Stan
http://ieugen.blogspot.com