Doug Cutting wrote:

I just applied this patch.  Thanks!

Doug

I caught two more useful cases - <frame> and <iframe>. I will refactor this a bit to avoid spaghetti conditionals, and submit another patch.


Are you aware of any sensible way of dealing with urls generated in javascript? There are two main cases here to cover:

* links hidden within event-handling logic, like scripts attached to onClick. I doubt you can do anything with them.

* links in document elements, which are generated by javascript. E.g.:

<script language=javascript>
document.write('<a href="/site/' +
        lang + '/index.html">' +
        lang + '</a>');
</script>

The above generates a new document element. If you parse the page into a DOM tree, and then use something like Rhino to execute the script, then perhaps... but it seems awfully complicated.

The reason I'm asking is that some of the sites I'm trying to index use this fancy-schmancy way of building indexes. So, the only thing I can get from them is the top level index.html... :-(


-- Best regards, Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to