On Thu, Jun 24, 2004 at 08:03:34PM +0200, Andrzej Bialecki wrote:
> 
> A comment about the crawling: I'm using Nutch in a setup that crawls 
> individual websites, several levels in depth. I noticed that for roughly 
> 30% of sites our users are interested in, the crawling fails to produce 
> enough results. After short investigation it seems that most of these 
> websites use javascript heavily. It seems a JS link extractor would be 
> very helpful...
> 
> With the pre-plugin version I had a solution for this, which used 
> HttpUnit. HttpUnit mimicks the browser, which means it retrieves several 
> resources at the same time, and builds a DOM model of the complete page 
> (with frames and scripts, and the JavaScript object model of a browser). 
> This solution worked exceptionally well - I was able to crawl 
> exhaustively 95+ % of the websites from the above selection.
> 
> However, with the current plugin structure it is difficult to use this 
> method, because the Fetcher passes the content piecewise (page by page 
> to the content extractor, and HttpUnit starts working only when several 
> resources are loaded... so, I'm back to square one.
> 

Yeh, the plugin structure does have certain limitations.

Assume you do not want dramatic change,
is the following a viable approach for you? That is

(1) at fetch step, pack whatever HttpUnit sees as compressed archive
(say tar), save it as the content with a tag in metaData

(2) at parse step, use (of course, write) a customized parser to
unpack and collect outlinks plus other things.

John


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to