Hi Andrezj and Chris I suppose not only JavaScript but CSS should be parsed, right? I used to read heritrix source code, so maybe we can borrow some idea from it.
RegexpJSLinkExtractor.java http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpJSLinkExtractor.java?rev=1.2&view=markup RegexpCSSLinkExtractor.java http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpCSSLinkExtractor.java?rev=1.2&view=log Regards /Jack On 4/30/05, Chris A Mattmann <[EMAIL PROTECTED]> wrote: > Hi Andrezj, > > > > > * JavaScript parsing: currently we ignore JavaScript completely. There > > are many sites where a lot of links (e.g. menus) are built dynamically. > > Currently it's impossible to harvest such links. I already made some > > tests with a full JavaScript interpreter (using Rhino), but it's too > > slow for massive crawling. A "good enough" solution similar to the one > > used in other crawlers is needed, namely to use a heuristic JS parser > > :-) (that is, try to match possible URLs within the script text - > > somewhat similar to the plain text link extractor). > > If you need help with this particular part, I would love to help you. Please > drop me an email offline with directions and suggestions if you would like > me to help and coordinate on this. > > > * fetching modified content only: this is related to the interaction > > between Fetcher and protocol plugins. A simple change in Protocol > > interface will allow all protocol plugins to decide, based on protocol > > headers, whether the content has been changed since the last fetching, > > and if they need to fetch the new content. This will result in > > tremendous bandwidth/disk/CPU savings. > > I could totally help out with this. Same deal, if you would like my help on > this, please let me know. > > Thanks, > Chris > >
