Hi Andrezj and Chris
I suppose not only JavaScript but CSS should be parsed, right? I used to read heritrix source code, so maybe we can borrow some idea from it.
RegexpJSLinkExtractor.java http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpJSLinkExtractor.java?rev=1.2&view=markup
RegexpCSSLinkExtractor.java http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpCSSLinkExtractor.java?rev=1.2&view=log
Extremely careful here... Heritrix is LGPL, we don't want to end up with a tainted code... IANAL, so perhaps implementing the same idea is OK, but how many ways can you implement a regexp matching the same pattern?
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
