There's one problem with this in my environment - I'm basically
running two datacenters with different uses. Hopefully this won't sound
too confusing :)  We're using one datacenter to do the fetching, with a
hadoop cluster of servers, and one datacenter to store content and make
available for searching, with another cluster of hadoop servers.  My
understanding is that when you make servers into slaves for a cluster,
there's no way to define specifically what they do - i.e., these servers
just do fetching, these servers just do indexing. Is that correct?
Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).
    I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!

Jason


Doug Cutting wrote:

> Jason Camp wrote:
>
>> I'd like to generate multiple segments in a row, and send them off to
>> another server, is this possible using the local file system?
>
>
> The Hadoop-based Nutch now automates multiple, parallel fetches for
> you.  So there is less need to manually generate multiple segments. 
> Try configuring your servers as slaves (by adding them to conf/slaves)
> and configuring a master (by setting fs.default.name and
> mapred.jobtracker in conf/hadoop-site.xml) then using bin/start-all.sh
> to start daemons. Then copy your root url directory to dfs with
> something like 'bin/hadoop dfs -put roots roots'.  Then you can run a
> multi-machine crawl with 'bin/nutch crawl'.  Or if you need
> finer-grained control, you can still step through the inject,
> generate, fetch, updatedb, generate, fetch, ... cycle, except now each
> step runs across all slave nodes.
>
> This is outlined in the Hadoop javadoc:
>
> http://lucene.apache.org/hadoop/docs/api/overview-summary.html#overview_description
>
>
> Doug


Reply via email to