[Nutch-general] Re: it seems that nutch ignores url which has query string

Rob Pettengill Mon, 11 Jul 2005 21:27:19 -0700

Guan Yu,

Sorry, I can't give you a solution, not knowing what yourrequirements are, these are just responses to your questions.

The decision to follow a link or not can only be based on the syntaxof the link. This can be misleading or uninformative as to thecontent type. The Regex and Prefix URL filters make the decisionwhether to follow a URL. How any content that comes back is handledis the responsibility of other parts of the system.

The content that comes back when a link is followed will be taggedwith a mime content type per the http spec (and even this issometimes wrong with mis-configured servers). How the content getshandled depends on the content handler plugins that you have loaded.Look at the plugin-includes property. If you want to parse and indexthe pdf file that is returned, then you need to load a pdf pluginwith that property. You will also have to consider thehttp.content.limit property if you want to index other content typesbecause they are often much larger than html files. I'm lessfamiliar with this part of the system and don't know how robust it isfor mis-labled content, but hopefully you don't have to worry aboutthat.


;rob


On 2005, Jul 11, at 8:24 PM, Guan Yu wrote:

Thanks Rob. Your solution works if I have a jsp page returning html
content. But it doesn't work if I have a servlet returning pdf file.

-----Original Message-----
From: Rob Pettengill [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 12, 2005 4:27 AM
To: [email protected]
Subject: Re: it seems that nutch ignores url which has query string

It's probably worth reading though all the files in the conf
directory to get an idea of what the
default settings are and what adjustments can be made there.

urls with "?" in them indicate that the files are generated in
response to the passed parameters.  In many cases these active pages
are not search friendly.  The link may have side effects (e.g.,
placing an order) that you don't want search to trigger or it may
lead to a search "black hole" that generates an infinite number of
links (e.g., a tomorrow link in a web calendar).

That is why "?" is included in one of the default exclusion rules in
the conf/regex-urlfilter.txt file:

   # skip URLs containing certain characters as probable queries, etc.
   [EMAIL PROTECTED]
Guan Yu,

If you totally take this out you will probably be sorry.  A better
approach might be to precede
this line with exceptions that you are sure will cause no problems.
For example I know one site that adds a gratuitous "?" to the end of
every asp url (I guess they are trying to hide from potential
customers who use search engines :-).  I can tell nutch that it is ok
to index "?" files from this site by adding the following line in
front of the pattern that skips "?" URLs:

   #exceptions to skip rule
   +search.unfriendly.site.com/.*\.asp\?$
   # skip URLs containing certain characters as probable queries, etc.
   [EMAIL PROTECTED]

The same technique can also be used to make exceptions to the other
rules, for example to index .pdf files only from sites in a certain
domain.
--
Robert C. Pettengill, Ph.D.
    [EMAIL PROTECTED]

Questions about petroleum?
     Goto:   http://AskAboutOil.com/

On 2005, Jul 10, at 9:38 PM, Guan Yu wrote:

Hi,

I'm using intranet search. There is links in my web pages like the
following:  <a
href="http://www.citycab.com.sg:8003/wsf/news.jsp?id=82";>News</a>. It
seems that the above link can't be found by nutch. How to solve this
problem?

Thanks,
Guan Yu




-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual

core and dual graphics technology at this free one hour event hosted by HP,AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: it seems that nutch ignores url which has query string

Reply via email to