Hi Howie, Thank you for valuable suggestion. I will consider it carefully. As I'm going to parse non-English (actually Chinese) pages, so I think maybe regular expressions are not very useful to me. I decide to integrate some simple date mining techniques to achieve it.
2006/2/19, Howie Wang <[EMAIL PROTECTED]>: > > I think doing this sort of thing works out very well for niche search > engines. > Analyzing the contents of the page takes up some time, but it's just > milliseconds > per page. If you contrast this with actually fetching a page that you > don't > want > (several seconds * num pages), you can see that the time savings are very > much > in your favor. > > I'm not sure if you'd create a URLFilter since I don't think that gives > you > easy > access to the page contents. You could do it in an HtmlParseFilter. Just > copy the > parse-html plugin, look for the bit of code where the Outlinks array is > set. > Then filter > that Outlinks array as you see fit. > > One thing to be careful about is using regular expressions in Java to > analyze the > page contents. I've had lots of problems with hanging using > java.util.regex. > I get > this with perfectly legal regex's, and it's only on certain pages that I > get > problems. > It's not as big a problem for me since most of my regex stuff is during > the > indexing phase, and it's easy to re-index. If it happens during the fetch, > it's a bigger > pain, since you have to recover from an aborted fetch. So you might want > to > do lots of small crawls, instead of big full crawls. > > Howie > > > >I think this can be done by using a plug-in like url filter, but it seems > >to > >cause the performance problem of the crawling process. So I'd like to > >listen > >to your opinions. Is it possible or meaningful to crawl not just by links > >but contents or terms? > > > -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
