Thanks Rob. Your solution works if I have a jsp page returning html content. But it doesn't work if I have a servlet returning pdf file.
-----Original Message----- From: Rob Pettengill [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 12, 2005 4:27 AM To: [email protected] Subject: Re: it seems that nutch ignores url which has query string It's probably worth reading though all the files in the conf directory to get an idea of what the default settings are and what adjustments can be made there. urls with "?" in them indicate that the files are generated in response to the passed parameters. In many cases these active pages are not search friendly. The link may have side effects (e.g., placing an order) that you don't want search to trigger or it may lead to a search "black hole" that generates an infinite number of links (e.g., a tomorrow link in a web calendar). That is why "?" is included in one of the default exclusion rules in the conf/regex-urlfilter.txt file: # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] Guan Yu, If you totally take this out you will probably be sorry. A better approach might be to precede this line with exceptions that you are sure will cause no problems. For example I know one site that adds a gratuitous "?" to the end of every asp url (I guess they are trying to hide from potential customers who use search engines :-). I can tell nutch that it is ok to index "?" files from this site by adding the following line in front of the pattern that skips "?" URLs: #exceptions to skip rule +search.unfriendly.site.com/.*\.asp\?$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] The same technique can also be used to make exceptions to the other rules, for example to index .pdf files only from sites in a certain domain. -- Robert C. Pettengill, Ph.D. [EMAIL PROTECTED] Questions about petroleum? Goto: http://AskAboutOil.com/ On 2005, Jul 10, at 9:38 PM, Guan Yu wrote: > Hi, > > I'm using intranet search. There is links in my web pages like the > following: <a > href="http://www.citycab.com.sg:8003/wsf/news.jsp?id=82">News</a>. It > seems that the above link can't be found by nutch. How to solve this > problem? > > Thanks, > Guan Yu > > > > ------------------------------------------------------- This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual core and dual graphics technology at this free one hour event hosted by HP, AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
