Thanks. I put the following in nutch-site.xml: <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property>
But somehow, when I do a search in Nutch, it still return results from other sites. Why? Vishal Shah-3 wrote: > > Hi Victor, > > In this case, the link analysis will be done only on the link graph > between the URLs belonging to the hosts in your seed lists that you > fetch. As you said, this might not give you a true idea of the link > popularities of your URLs. On the other hand, if you set > db.ignore.external.links to false, you will be crawling URLs outside > your seed hosts, and it would be difficult to control the crawl for you. > Since you are only interested in your seed list hosts, I would still > recommend setting db.ignore.external.links to true to limit your crawl > to your target hosts and if required change the scoring algorithm. > > How big is your set of seed lists? Do you know how well these hosts > link to each other? If the seed hosts belong to the same domain for > e.g., there might be high interlinking between URLs from these hosts, > and you should see decent results with the default ranker. > > -vishal. > > -----Original Message----- > From: victor_emailbox [mailto:[EMAIL PROTECTED] > Sent: Friday, September 01, 2006 12:51 PM > To: [email protected] > Subject: RE: How to Make Nutch Return Search Results Belonged to the > Crawl URL Li > > > Thanks. But if I set db.ignore.external.links to false, then will it > affect > the quality of the search result? I read about Nutch, and it seems that > it > does something similar to Pagelink like Google. If so, it will affect > the > quality of the search if it doesn't analyze the external links. > > > Vishal Shah-3 wrote: >> >> Hello Victor, >> >> If I understand correctly, you want to use a seed list that contains >> some sites, and then do an internal search only on pages belonging to >> these sites. In this case, it's best not to crawl pages from other >> sites. This can be done by setting db.ignore.external.links to false > in >> your nutch-site.xml. This will ensure that your crawl is only limited > to >> pages from initially injected hosts. >> >> Regards, >> >> -vishal. >> >> -----Original Message----- >> From: victor_emailbox [mailto:[EMAIL PROTECTED] >> Sent: Thursday, August 31, 2006 10:51 AM >> To: [email protected] >> Subject: Re: How to Make Nutch Return Search Results Belonged to the >> Crawl URL Li >> >> >> No, I meant if the crawling url lists have http://www.abc.com and >> http://www.bcc.com, and both urls contains the term "hello". bbc.com >> also >> has a link that references ccc.com which also contains the term > "hello" >> but >> it is not part of the crawling url lists. >> >> So when I do a search on "hello", will Nutch return abc.com, bcc.com > and >> ccc.com in default? If so, how to force Nutch to return both abc.com >> and >> bcc.com without ccc.com? >> >> Thanks. >> >> >> Zaheed Haque wrote: >>> >>> Hi >>> >>> You mean show results from a site http://abc.com only. If so you need >>> to turn on your index-more and query-more plugins in nutch-site.xml >>> then you need to use query like site:http://abc.com +query term or >>> url: .. I think its site not sure. >>> >>> Cheers >>> >>> On 8/31/06, victor_emailbox <[EMAIL PROTECTED]> wrote: >>>> >>>> Hi, >>>> I enter 10 urls in the url crawling list. Nutch does its thing to >>>> fetch >>>> and index them. How to I force Nutch to return search results that >>>> belongs >>>> to the url list? e.g. if the url crawling list has only >>>> http://www.abc.com >>>> and http://www.bcc.com, then all search result should be under > either >>>> abc.com or bbc.com, not ccc.com even if bbc.com contains links >> referring >>>> to >>>> ccc.com. >>>> >>>> Many thanks. >>>> -- >>>> View this message in context: >>>> >> > http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t >> o-the-Crawl-URL-List--tf2194391.html#a6072986 >>>> Sent from the Nutch - User forum at Nabble.com. >>>> >>>> >>> >>> >> >> -- >> View this message in context: >> > http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t >> o-the-Crawl-URL-List--tf2194391.html#a6073242 >> Sent from the Nutch - User forum at Nabble.com. >> >> >> > > -- > View this message in context: > http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t > o-the-Crawl-URL-List--tf2194391.html#a6093923 > Sent from the Nutch - User forum at Nabble.com. > > > -- View this message in context: http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-to-the-Crawl-URL-List--tf2194391.html#a6212829 Sent from the Nutch - User forum at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
