[EMAIL PROTECTED] wrote:

On Thu, Jun 24, 2004 at 08:03:34PM +0200, Andrzej Bialecki wrote:

A comment about the crawling: I'm using Nutch in a setup that crawls individual websites, several levels in depth. I noticed that for roughly 30% of sites our users are interested in, the crawling fails to produce enough results. After short investigation it seems that most of these websites use javascript heavily. It seems a JS link extractor would be very helpful...

With the pre-plugin version I had a solution for this, which used HttpUnit. HttpUnit mimicks the browser, which means it retrieves several resources at the same time, and builds a DOM model of the complete page (with frames and scripts, and the JavaScript object model of a browser). This solution worked exceptionally well - I was able to crawl exhaustively 95+ % of the websites from the above selection.

However, with the current plugin structure it is difficult to use this method, because the Fetcher passes the content piecewise (page by page to the content extractor, and HttpUnit starts working only when several resources are loaded... so, I'm back to square one.



Yeh, the plugin structure does have certain limitations.

Assume you do not want dramatic change,
is the following a viable approach for you? That is

(1) at fetch step, pack whatever HttpUnit sees as compressed archive
(say tar), save it as the content with a tag in metaData

(2) at parse step, use (of course, write) a customized parser to
unpack and collect outlinks plus other things.

Interesting... Perhaps, I'm not sure. HttpUnit likes to keep some internal state around, between going from one page to another - after all the idea is to mimick a browser, and to follow the links by actually executing the DOM events, either built-in or defined through Javascript - pretty smart solution, and one that gives excellent results (if you want to test it, let me know).


However, this means that a lot of internal state needs to be kept around. E.g. in case of a website with frames, quite often some navigation javascript is in one frame, and stays around when you visit other pages (frames, really). This means that there are some JS variables and functions defined, or simply DOM parts defined in the unchanged frames, which I would have to persist until the next fetch round, and restore the state of HttpUnit just before the next fetch. I'm pretty sure that's not possible without some heavy mods to HttpUnit...

What it boils down to is that with the current strong split between Fetcher and Parser it's more difficult to apply the same approach... because HttpUnit played multiple roles of FetchList tool, Fetcher and Parser, while keeping around the state. I'll have to think more whether this is still possible in the new structure...

Or perhaps I should attack the problem from a completely different angle - to treat HttpUnit as a replacement for the three tools, and just use the plugins as they use them...

I also had a look at Heritrix (archive.org's crawler), and its JS Extractor. They took a different route - they don't try to resolve the links by tracking the DOM model and executing events, as HttpUnit does. They simply apply some smart regexps and heuristics to find out a "good enough" subset of probable links. I'm pretty sure they end up with a lot of noise, but _some_ of the links can still be captured this way... I'll have to try it on my test sites, to see how well it works.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to