Any idea of nutch's plugin to parse XML stylesheet? Thanks.

Windflying Thu, 13 Nov 2008 02:17:15 -0800

Hi all,


I'd like to use nutch to crawl my internal company svn repository. But it
cannot work, always "0 urls". 

The structure is similar to http://svn.macosforge.org/repository/macports/
site. 

 

Thanks a lot for Alex's great help and clear interpretation, always be good,
man, finally I got the reason as following (from Alex's saying):

"Look at the source for the http://svn.macosforge.org/repository/macports/
site. It contains SVN internal xml markup and it is not HTML. When brawser
downloads content from this page it automaticaly applies XSL stylesheet
refernced from the XML and whcih produces HTML. 
Nutch cannot do it by default. When it download content it tries to parse it
with HTML parser and of cource doesn't see  the <a> tag and so doesn't
produce new links. 
I am affraid you should develope special plugin which would apply XML
stylesheet and place it before HTML paser. "

 

Does anybody know if there is a plugin to make nutch parse XML stylesheet?
Any idear?

 

Thanks.

 

Bryan

Any idea of nutch's plugin to parse XML stylesheet? Thanks.

Reply via email to