But I also find a problem. Some links extracted from a page may have some internal spaces like "http://www.domain.com/sub/dynamic .0001.html". I guess which is caused by the style file of the page. The link can be extracted but in fact it's a wrong link, which can't be followed further.
2006/2/20, Andrzej Bialecki <[EMAIL PROTECTED]>: > > Elwin wrote: > > No I don't try to do that. I just use the default paser for the plguin. > It > > seems that it works well now. > > Thx. > > > > I often find TagSoup performing better than NekoHTML. In case of some > grave HTML errors Neko tends to simply truncate the document, while > TagSoup just "keeps on truckin'". This is especially true for pages with > multiple <html> elements, where Neko ignores all elements but the first > one, while TagSoup just treats any <html> elements inside a document > like any other nested element. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
