Re: [Nutch-general] How to Make Nutch Return Search Results Belonged to the Crawl URL Li

victor_emailbox Fri, 08 Sep 2006 09:53:51 -0700

Thanks.
I put the following in nutch-site.xml:
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
   </description>
 </property>


But somehow, when I do a search in Nutch, it still return results from other
sites.  Why?


Vishal Shah-3 wrote:
> 
> Hi Victor,
> 
>   In this case, the link analysis will be done only on the link graph
> between the URLs belonging to the hosts in your seed lists that you
> fetch. As you said, this might not give you a true idea of the link
> popularities of your URLs. On the other hand, if you set
> db.ignore.external.links to false, you will be crawling URLs outside
> your seed hosts, and it would be difficult to control the crawl for you.
> Since you are only interested in your seed list hosts, I would still
> recommend setting db.ignore.external.links to true to limit your crawl
> to your target hosts and if required change the scoring algorithm. 
> 
>   How big is your set of seed lists? Do you know how well these hosts
> link to each other? If the seed hosts belong to the same domain for
> e.g., there might be high interlinking between URLs from these hosts,
> and you should see decent results with the default ranker.
> 
> -vishal.
> 
> -----Original Message-----
> From: victor_emailbox [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 01, 2006 12:51 PM
> To: [email protected]
> Subject: RE: How to Make Nutch Return Search Results Belonged to the
> Crawl URL Li
> 
> 
> Thanks.  But if I set db.ignore.external.links to false, then will it
> affect
> the quality of the search result?  I read about Nutch, and it seems that
> it
> does something similar to Pagelink like Google.  If so, it will affect
> the
> quality of the search if it doesn't analyze the external links.
> 
> 
> Vishal Shah-3 wrote:
>> 
>> Hello Victor,
>> 
>>   If I understand correctly, you want to use a seed list that contains
>> some sites, and then do an internal search only on pages belonging to
>> these sites. In this case, it's best not to crawl pages from other
>> sites. This can be done by setting db.ignore.external.links to false
> in
>> your nutch-site.xml. This will ensure that your crawl is only limited
> to
>> pages from initially injected hosts.
>> 
>> Regards,
>> 
>> -vishal.
>> 
>> -----Original Message-----
>> From: victor_emailbox [mailto:[EMAIL PROTECTED] 
>> Sent: Thursday, August 31, 2006 10:51 AM
>> To: [email protected]
>> Subject: Re: How to Make Nutch Return Search Results Belonged to the
>> Crawl URL Li
>> 
>> 
>> No, I meant if the crawling url lists have http://www.abc.com and
>> http://www.bcc.com, and both urls contains the term "hello".  bbc.com
>> also
>> has a link that references ccc.com which also contains the term
> "hello"
>> but
>> it is not part of the crawling url lists.
>> 
>> So when I do a search on "hello", will Nutch return abc.com, bcc.com
> and
>> ccc.com in default?  If so,  how to force Nutch to return both abc.com
>> and
>> bcc.com without ccc.com?  
>> 
>> Thanks.
>> 
>> 
>> Zaheed Haque wrote:
>>> 
>>> Hi
>>> 
>>> You mean show results from a site http://abc.com only. If so you need
>>> to turn on your index-more and query-more plugins in nutch-site.xml
>>> then you need to use query like  site:http://abc.com +query term or
>>> url: .. I think its site not sure.
>>> 
>>> Cheers
>>> 
>>> On 8/31/06, victor_emailbox <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Hi,
>>>>   I enter 10 urls in the url crawling list.  Nutch does its thing to
>>>> fetch
>>>> and index them.  How to I force Nutch to return search results that
>>>> belongs
>>>> to the url list?  e.g. if the url crawling list has only
>>>> http://www.abc.com
>>>> and http://www.bcc.com, then all search result should be under
> either
>>>> abc.com or bbc.com, not ccc.com even if bbc.com contains links
>> referring
>>>> to
>>>> ccc.com.
>>>>
>>>> Many thanks.
>>>> --
>>>> View this message in context:
>>>>
>>
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
>> o-the-Crawl-URL-List--tf2194391.html#a6072986
>>>> Sent from the Nutch - User forum at Nabble.com.
>>>>
>>>>
>>> 
>>> 
>> 
>> -- 
>> View this message in context:
>>
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
>> o-the-Crawl-URL-List--tf2194391.html#a6073242
>> Sent from the Nutch - User forum at Nabble.com.
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
> o-the-Crawl-URL-List--tf2194391.html#a6093923
> Sent from the Nutch - User forum at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-to-the-Crawl-URL-List--tf2194391.html#a6212829
Sent from the Nutch - User forum at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to Make Nutch Return Search Results Belonged to the Crawl URL Li

Reply via email to