Jack Tang wrote:
Hi Andrezj and Chris

I suppose not only JavaScript but CSS should be parsed, right?
I used to read heritrix source code, so maybe we can borrow some idea from it.

RegexpJSLinkExtractor.java
http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpJSLinkExtractor.java?rev=1.2&view=markup

RegexpCSSLinkExtractor.java
http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpCSSLinkExtractor.java?rev=1.2&view=log

Extremely careful here... Heritrix is LGPL, we don't want to end up with a tainted code... IANAL, so perhaps implementing the same idea is OK, but how many ways can you implement a regexp matching the same pattern?



-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com



Reply via email to