I'm new to Nutch, but I couldn't find this in the archives or docs and it has me stumped.
I have two websites that I need to index in Nutch. I am presently running two separate crawls to index these sites, but a single link is screwing up my search results. I have two flat files in my Nutch directory, "Domain1" and "Domain2". Each of these files contains the appropriate starting URL for each of the two sites, and the two crawls generate completely separate database folders, which are in turn called by two independent Nutch frontend installations in Tomcat. My problem is with the crawl-urlfilter.txt file. Because this is a local search, I need to limit the domains and the file contains these lines: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*domain1.edu/ +^http://([a-z0-9]*\.)*domain2.edu/ This would work perfectly EXCEPT that there is a single link on the domain1.edu site to the homepage of the domain2.edu site. Nutch is following this link, and as a result the domain1 search results are bringing up the full domain1.edu AND domain2.edu sites. What's the best way to deal with this problem? When I run the Domain1 Nutch search, I need the results to be limited to the domain1.edu, subdomain1.domain1.edu, and subdomain2.domain1.edu websites. Likewise, if I add a reciprocal link to domain2.edu, I need users of THAT search interface to receive results only relevant to that domain. PLEASE don't tell me I need two independent Nutch installations! Your help is appreciated. Brian Hill ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
