Re: Any idea of nutch's plugin to parse XML stylesheet? Thanks.

jianguo cai Wed, 19 Nov 2008 07:12:34 -0800

I think you may have to implement it yourself since it is so specific.


2008/11/13 Windflying <[EMAIL PROTECTED]>

> Hi all,
>
>
>
> I'd like to use nutch to crawl my internal company svn repository. But it
> cannot work, always "0 urls".
>
> The structure is similar to http://svn.macosforge.org/repository/macports/
> site.
>
>
>
> Thanks a lot for Alex's great help and clear interpretation, always be
> good,
> man, finally I got the reason as following (from Alex's saying):
>
> "Look at the source for the http://svn.macosforge.org/repository/macports/
> site. It contains SVN internal xml markup and it is not HTML. When brawser
> downloads content from this page it automaticaly applies XSL stylesheet
> refernced from the XML and whcih produces HTML.
> Nutch cannot do it by default. When it download content it tries to parse
> it
> with HTML parser and of cource doesn't see  the <a> tag and so doesn't
> produce new links.
> I am affraid you should develope special plugin which would apply XML
> stylesheet and place it before HTML paser. "
>
>
>
> Does anybody know if there is a plugin to make nutch parse XML stylesheet?
> Any idear?
>
>
>
> Thanks.
>
>
>
> Bryan
>
>
>
>

Re: Any idea of nutch's plugin to parse XML stylesheet? Thanks.

Reply via email to