Hi all,

 

I'd like to use nutch to crawl my internal company svn repository. But it
cannot work, always "0 urls". 

The structure is similar to http://svn.macosforge.org/repository/macports/
site. 

 

Thanks a lot for Alex's great help and clear interpretation, always be good,
man, finally I got the reason as following (from Alex's saying):

"Look at the source for the http://svn.macosforge.org/repository/macports/
site. It contains SVN internal xml markup and it is not HTML. When brawser
downloads content from this page it automaticaly applies XSL stylesheet
refernced from the XML and whcih produces HTML. 
Nutch cannot do it by default. When it download content it tries to parse it
with HTML parser and of cource doesn't see  the <a> tag and so doesn't
produce new links. 
I am affraid you should develope special plugin which would apply XML
stylesheet and place it before HTML paser. "

 

Does anybody know if there is a plugin to make nutch parse XML stylesheet?
Any idear?

 

Thanks.

 

Bryan

 

Reply via email to