Lars Aronsson wrote:

Andrzej Bialecki wrote:

Reading the other day the searchenginewatch forum I came to conclusion
that currently Nutch is rather careless about the bandwidth


To be really economic with bandwidth, the search engine should only
fetch enough information to present as search hits.  Instead of just
registering if the page has changed (and how often), it could also
register how often the page has been showed in a query hit list.  If

The algorithm you describe leads to limiting the updates to a set of information retrieved by most frequent historical queries - and if new users try to look for "entomology" the hits could be disappointing - because they would reflect subjective interest of other people, and not objective results. I guess my point is that the set of possible queries on a public engine is open-ended, and for Intranet search engines there are other safer methods to do it.


So, it should probably use also a threshold for maximum value of fetch interval, so that even if there were no queries on the topic so far, at least you can still present something which is not embarassingly old... :-)

all users only query for topics in metallurgy, it is quite useless to
fetch new versions of a page on entomology (assuming that the page
will stay on topic).  Especially with a do-it-yourself search engine
like Nutch, I would guess there are many applications that target
small user communities with a narrow focus.  However, updating the
database for every search query might be more expensive than fetching
a few more pages.  It depends on how many you have of each kind.

I'm not sure I agree with this part - I would think that this function should be better handled by url filters, or a custom parser plugin to prune outlinks based on some external criteria.


Thanks for the comments!

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to