On Thu, Jun 24, 2004 at 08:03:34PM +0200, Andrzej Bialecki wrote: > > A comment about the crawling: I'm using Nutch in a setup that crawls > individual websites, several levels in depth. I noticed that for roughly > 30% of sites our users are interested in, the crawling fails to produce > enough results. After short investigation it seems that most of these > websites use javascript heavily. It seems a JS link extractor would be > very helpful... > > With the pre-plugin version I had a solution for this, which used > HttpUnit. HttpUnit mimicks the browser, which means it retrieves several > resources at the same time, and builds a DOM model of the complete page > (with frames and scripts, and the JavaScript object model of a browser). > This solution worked exceptionally well - I was able to crawl > exhaustively 95+ % of the websites from the above selection. > > However, with the current plugin structure it is difficult to use this > method, because the Fetcher passes the content piecewise (page by page > to the content extractor, and HttpUnit starts working only when several > resources are loaded... so, I'm back to square one. >
Yeh, the plugin structure does have certain limitations. Assume you do not want dramatic change, is the following a viable approach for you? That is (1) at fetch step, pack whatever HttpUnit sees as compressed archive (say tar), save it as the content with a tag in metaData (2) at parse step, use (of course, write) a customized parser to unpack and collect outlinks plus other things. John ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
