Hi Andrezj and Chris

I suppose not only JavaScript but CSS should be parsed, right?
I used to read heritrix source code, so maybe we can borrow some idea from it.

RegexpJSLinkExtractor.java
http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpJSLinkExtractor.java?rev=1.2&view=markup

RegexpCSSLinkExtractor.java
http://cvs.sourceforge.net/viewcvs.py/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/extractor/RegexpCSSLinkExtractor.java?rev=1.2&view=log


Regards
/Jack


On 4/30/05, Chris A Mattmann <[EMAIL PROTECTED]> wrote:
> Hi Andrezj,
> 
> >
> > * JavaScript parsing: currently we ignore JavaScript completely. There
> > are many sites where a lot of links (e.g. menus) are built dynamically.
> > Currently it's impossible to harvest such links. I already made some
> > tests with a full JavaScript interpreter (using Rhino), but it's too
> > slow for massive crawling. A "good enough" solution similar to the one
> > used in other crawlers is needed, namely to use a heuristic JS parser
> > :-) (that is, try to match possible URLs within the script text -
> > somewhat similar to the plain text link extractor).
> 
> If you need help with this particular part, I would love to help you. Please
> drop me an email offline with directions and suggestions if you would like
> me to help and coordinate on this.
> 
> > * fetching modified content only: this is related to the interaction
> > between Fetcher and protocol plugins. A simple change in Protocol
> > interface will allow all protocol plugins to decide, based on protocol
> > headers, whether the content has been changed since the last fetching,
> > and if they need to fetch the new content. This will result in
> > tremendous bandwidth/disk/CPU savings.
> 
> I could totally help out with this. Same deal, if you would like my help on
> this, please let me know.
> 
> Thanks,
>  Chris
> 
>

Reply via email to