Hi all,
I'd like to use nutch to crawl my internal company svn repository. But it cannot work, always "0 urls". The structure is similar to http://svn.macosforge.org/repository/macports/ site. Thanks a lot for Alex's great help and clear interpretation, always be good, man, finally I got the reason as following (from Alex's saying): "Look at the source for the http://svn.macosforge.org/repository/macports/ site. It contains SVN internal xml markup and it is not HTML. When brawser downloads content from this page it automaticaly applies XSL stylesheet refernced from the XML and whcih produces HTML. Nutch cannot do it by default. When it download content it tries to parse it with HTML parser and of cource doesn't see the <a> tag and so doesn't produce new links. I am affraid you should develope special plugin which would apply XML stylesheet and place it before HTML paser. " Does anybody know if there is a plugin to make nutch parse XML stylesheet? Any idear? Thanks. Bryan
