Guan Yu,
Sorry, I can't give you a solution, not knowing what your
requirements are, these are just responses to your questions.
The decision to follow a link or not can only be based on the syntax
of the link. This can be misleading or uninformative as to the
content type. The Regex and Prefix URL filters make the decision
whether to follow a URL. How any content that comes back is handled
is the responsibility of other parts of the system.
The content that comes back when a link is followed will be tagged
with a mime content type per the http spec (and even this is
sometimes wrong with mis-configured servers). How the content gets
handled depends on the content handler plugins that you have loaded.
Look at the plugin-includes property. If you want to parse and index
the pdf file that is returned, then you need to load a pdf plugin
with that property. You will also have to consider the
http.content.limit property if you want to index other content types
because they are often much larger than html files. I'm less
familiar with this part of the system and don't know how robust it is
for mis-labled content, but hopefully you don't have to worry about
that.
;rob
On 2005, Jul 11, at 8:24 PM, Guan Yu wrote:
Thanks Rob. Your solution works if I have a jsp page returning html
content. But it doesn't work if I have a servlet returning pdf file.
-----Original Message-----
From: Rob Pettengill [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 12, 2005 4:27 AM
To: [email protected]
Subject: Re: it seems that nutch ignores url which has query string
It's probably worth reading though all the files in the conf
directory to get an idea of what the
default settings are and what adjustments can be made there.
urls with "?" in them indicate that the files are generated in
response to the passed parameters. In many cases these active pages
are not search friendly. The link may have side effects (e.g.,
placing an order) that you don't want search to trigger or it may
lead to a search "black hole" that generates an infinite number of
links (e.g., a tomorrow link in a web calendar).
That is why "?" is included in one of the default exclusion rules in
the conf/regex-urlfilter.txt file:
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
Guan Yu,
If you totally take this out you will probably be sorry. A better
approach might be to precede
this line with exceptions that you are sure will cause no problems.
For example I know one site that adds a gratuitous "?" to the end of
every asp url (I guess they are trying to hide from potential
customers who use search engines :-). I can tell nutch that it is ok
to index "?" files from this site by adding the following line in
front of the pattern that skips "?" URLs:
#exceptions to skip rule
+search.unfriendly.site.com/.*\.asp\?$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
The same technique can also be used to make exceptions to the other
rules, for example to index .pdf files only from sites in a certain
domain.
--
Robert C. Pettengill, Ph.D.
[EMAIL PROTECTED]
Questions about petroleum?
Goto: http://AskAboutOil.com/
On 2005, Jul 10, at 9:38 PM, Guan Yu wrote:
Hi,
I'm using intranet search. There is links in my web pages like the
following: <a
href="http://www.citycab.com.sg:8003/wsf/news.jsp?id=82">News</a>. It
seems that the above link can't be found by nutch. How to solve this
problem?
Thanks,
Guan Yu
-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP,
AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general