Re: Question about crawldb and segments

Jason Camp Sun, 30 Apr 2006 14:16:09 -0700

Hey Doug,

I'm finally picking this up again, and I believe I have everythingconfigured as you suggested - I'm running dfs at the indexingdatacenter, along with a job tracker/namenode/task trackers. I'm runninga job tracker/namenode/task trackers at the crawling datacenter, and itsall pointed to use dfs at the indexing datacenter.

If I have mapreduce set to run locally, everything seems to be workingfine. But if I set mapred.job.tracker to the job tracker that's runningin the crawl datacenter, I get this when trying to run a fetch:


060430 155743 Waiting to find target node: crawl01/XX.XX.XX.XX:50010

For some reason, even though I'm specifying the fs.default.name as myother datacenter, it's trying to make itself a dfs node also. I wasexperimenting with some of the shell scripts, and it seems that in anyconfiguration where this server is a task tracker, it also tries to be adfs node. Am I missing something, or is there no way to make this serveronly a task tracker and not a dfs node also?

Let me know if this makes sense, its even a little confusing for me, andI set it up :)


Thanks!

Jason


Doug Cutting wrote:

Jason Camp wrote:
Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).
But to perform the copies, you're using a lot of bandwidth to your"indexing" datacenter, no? Copying segments probably takes almost asmuch bandwidth as fetching them...
    I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :)Thanks!
You can use different sets of machines for dfs and MapReduce, bystarting them in differently configured installations. So you couldrun dfs only in your "indexing" datacenter, and MapReduce in bothdatacenters configured to talk to the same dfs, at the "indexing"datacenter. Then your fetch tasks as the "fetch" datacenter wouldwrite their output to the "indexing" datacenter's dfs. Andparse/updatedb/generate/index/etc. could all run at the otherdatacenter. Does that make sense?
Doug

Re: Question about crawldb and segments

Reply via email to