I'm having a fairly serious problem here and I'd be very appreciative if
anyone could toss me some ideas.
 
I am responsible for two separate and independent websites, and recently
the decision was made that they both needed search facilities. One
wrinkle in the plan, however, was that I only had one physical server to
implement both search engines on.  I decided to go with Nutch 0.7.2, and
have two web interfaces configured in tomcat. There are two IP's
configured in Apache, routing to two separate war file installations,
which permits two independent interfaces. The indexer is run twice,
generating two independent search databases for the parent sites, and
each of the search interfaces draws its results from the appropriate
database.
 
My problem is the crawl-urlfilter.txt file at
/nutch/conf/crawl-urlfilter.txt. Because the single crawler is crawling
both websites, the software requires that I put the masks for both URL's
into the file. This wouldn't be a problem, except that SiteB has one
single link in to SiteA. Nutch is following that link, and because it's
in the urlfilter.txt file the SiteB index ends up with a complete index
of BOTH websites.
 
Is there any way to specify a different crawl-urlfilter.txt file for
each crawl? When I index SiteA, I have a handful of URL masks that I
want to have available to it. When I index SiteB, I have a different set
of URL masks that I want available there. Am I going to need two
completely separate Nutch installations?
 
Brian Hill
 

Reply via email to