[Nutch-general] Re: AW: extract links problem with parse-html plugin

Elwin Mon, 20 Feb 2006 06:41:07 -0800

But I also find a problem. Some links extracted from a page may have some
internal spaces like "http://www.domain.com/sub/dynamic        .0001.html".
I guess which is caused by the style file of the page. The link can be
extracted but in fact it's a wrong link, which can't be followed further.




2006/2/20, Andrzej Bialecki <[EMAIL PROTECTED]>:
>
> Elwin wrote:
> > No I don't try to do that. I just use the default paser for the plguin.
> It
> > seems that it works well now.
> > Thx.
> >
>
> I often find TagSoup performing better than NekoHTML. In case of some
> grave HTML errors Neko tends to simply truncate the document, while
> TagSoup just "keeps on truckin'". This is especially true for pages with
> multiple <html> elements, where Neko ignores all elements but the first
> one, while TagSoup just treats any <html> elements inside a document
> like any other nested element.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

[Nutch-general] Re: AW: extract links problem with parse-html plugin

Reply via email to