Michael Cafarella wrote:
It's really helpful to get feedback like this; I had done some
tests of Nutch results quality a long time ago, but this is the first external one that I know of.
If anyone from OSU is listening, thanks for your help!
--Mike
PS - Does anyone know (Doug?) whether we are crawling the entire OSU site? Does Google have a coverage advantage?
A comment about the crawling: I'm using Nutch in a setup that crawls individual websites, several levels in depth. I noticed that for roughly 30% of sites our users are interested in, the crawling fails to produce enough results. After short investigation it seems that most of these websites use javascript heavily. It seems a JS link extractor would be very helpful...
With the pre-plugin version I had a solution for this, which used HttpUnit. HttpUnit mimicks the browser, which means it retrieves several resources at the same time, and builds a DOM model of the complete page (with frames and scripts, and the JavaScript object model of a browser). This solution worked exceptionally well - I was able to crawl exhaustively 95+ % of the websites from the above selection.
However, with the current plugin structure it is difficult to use this method, because the Fetcher passes the content piecewise (page by page to the content extractor, and HttpUnit starts working only when several resources are loaded... so, I'm back to square one.
To summarize: I'm not surprised by the results of this study, the Nutch crawler should be more resilient when retrieving links.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
