So not linked Urls will be discovered as well.
This works for all content types like pdf, text or word documents as well.
My intention was to generalize the problem and use pattern matching instead of trusting a human do it technical correct.
In case we use not the java 1.4.x build in regular expression but one of the performance tuned reg ex packages (apache) the speed if fair enough.
Stefan
Am 27.05.2004 um 22:27 schrieb Andrzej Bialecki:
Hi,
Quite often crawlers are not able to collect all links from a page, because the links are constructed from Javascript. In some extreme cases crawlers only get the root page and nothing else. As a part of my work assignment I had to address this problem.
I'm testing now a prototype of Javascript (Rhino) enabled web client, based on HttpUnit, which I'm using to collect links for the Fetcher. It is working quite well. E.g. for the page http://www.ad.se I'm getting the following direct outlinks:
http://www.ad.se http://www.ad.se/nyad/top.php http://www.ad.se/nyad/index.php http://www.ad.se/nyad/services.php?service=nyheter&linkid=7000 http://www.ad.se/nyad/services.php?service=arkiv&linkid=7003 http://www.ad.se/nyad/services.php?service=bors&linkid=7001 http://www.ad.se/nyad/services.php?service=foretag&linkid=7002 http://www.ad.se/nyad/services.php?service=bevakning&linkid=7004 http://www.ad.se/screen.html http://www.ad.se/nyad/services.php?service=&nblink=
Do you have any examples of such difficult pages you would like me to test?
Most probably I would be able to give back the code to Nutch - this is still under discussion.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
--------------------------------------------------------------- open technology: http://www.media-style.com open source: http://www.weta-group.net open discussion: http://www.text-mining.org
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
