Hey Doug,
I'm finally picking this up again, and I believe I have everything configured as you suggested - I'm running dfs at the indexing datacenter, along with a job tracker/namenode/task trackers. I'm running a job tracker/namenode/task trackers at the crawling datacenter, and its all pointed to use dfs at the indexing datacenter.

If I have mapreduce set to run locally, everything seems to be working fine. But if I set mapred.job.tracker to the job tracker that's running in the crawl datacenter, I get this when trying to run a fetch:

060430 155743 Waiting to find target node: crawl01/XX.XX.XX.XX:50010

For some reason, even though I'm specifying the fs.default.name as my other datacenter, it's trying to make itself a dfs node also. I was experimenting with some of the shell scripts, and it seems that in any configuration where this server is a task tracker, it also tries to be a dfs node. Am I missing something, or is there no way to make this server only a task tracker and not a dfs node also?

Let me know if this makes sense, its even a little confusing for me, and I set it up :)

Thanks!

Jason


Doug Cutting wrote:

Jason Camp wrote:

Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).


But to perform the copies, you're using a lot of bandwidth to your "indexing" datacenter, no? Copying segments probably takes almost as much bandwidth as fetching them...

    I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!


You can use different sets of machines for dfs and MapReduce, by starting them in differently configured installations. So you could run dfs only in your "indexing" datacenter, and MapReduce in both datacenters configured to talk to the same dfs, at the "indexing" datacenter. Then your fetch tasks as the "fetch" datacenter would write their output to the "indexing" datacenter's dfs. And parse/updatedb/generate/index/etc. could all run at the other datacenter. Does that make sense?

Doug


Reply via email to