Thanks Rob. Your solution works if I have a jsp page returning html
content. But it doesn't work if I have a servlet returning pdf file. 

-----Original Message-----
From: Rob Pettengill [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 12, 2005 4:27 AM
To: [email protected]
Subject: Re: it seems that nutch ignores url which has query string

It's probably worth reading though all the files in the conf  
directory to get an idea of what the
default settings are and what adjustments can be made there.

urls with "?" in them indicate that the files are generated in  
response to the passed parameters.  In many cases these active pages  
are not search friendly.  The link may have side effects (e.g.,  
placing an order) that you don't want search to trigger or it may  
lead to a search "black hole" that generates an infinite number of  
links (e.g., a tomorrow link in a web calendar).

That is why "?" is included in one of the default exclusion rules in  
the conf/regex-urlfilter.txt file:

   # skip URLs containing certain characters as probable queries, etc.
   [EMAIL PROTECTED]
Guan Yu,

If you totally take this out you will probably be sorry.  A better  
approach might be to precede
this line with exceptions that you are sure will cause no problems.   
For example I know one site that adds a gratuitous "?" to the end of  
every asp url (I guess they are trying to hide from potential  
customers who use search engines :-).  I can tell nutch that it is ok  
to index "?" files from this site by adding the following line in  
front of the pattern that skips "?" URLs:

   #exceptions to skip rule
   +search.unfriendly.site.com/.*\.asp\?$
   # skip URLs containing certain characters as probable queries, etc.
   [EMAIL PROTECTED]

The same technique can also be used to make exceptions to the other  
rules, for example to index .pdf files only from sites in a certain  
domain.
--
Robert C. Pettengill, Ph.D.
    [EMAIL PROTECTED]

Questions about petroleum?
     Goto:   http://AskAboutOil.com/

On 2005, Jul 10, at 9:38 PM, Guan Yu wrote:

> Hi,
>
> I'm using intranet search. There is links in my web pages like the
> following:  <a
> href="http://www.citycab.com.sg:8003/wsf/news.jsp?id=82";>News</a>. It
> seems that the above link can't be found by nutch. How to solve this
> problem?
>
> Thanks,
> Guan Yu
>
>
>
>



-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP,
AMD, and NVIDIA.  To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to