Stefan Groschupf wrote:

In the content extractor plugin I had done I use as general solution for not extractable links regular expression and do a content analysis.
So not linked Urls will be discovered as well.

Sure, but only for certain subsets of javascript/links mix, where the URL is still recognizable. Unfortunately, there is quite a number of sites and pages that do something like this (taken from a real page):


function OAS_NORMAL(pos) {
document.write('<A HREF="' + OAS_url + 'click_nx.ads/' + OAS_sitepage + '/1' + OAS_rns + '@' + OAS_listpos + '!' + pos + OAS_query + '" TARGET=_top>');
}



This works for all content types like pdf, text or word documents as well.
My intention was to generalize the problem and use pattern matching instead of trusting a human do it technical correct.

My priority was HTML content. However, the method you describe would be quite useful as well.


In case we use not the java 1.4.x build in regular expression but one of the performance tuned reg ex packages (apache) the speed if fair enough.


--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to