Hi Howie,

  Thank you for valuable suggestion. I will consider it carefully.
  As I'm going to parse non-English (actually Chinese) pages, so I think
maybe  regular expressions are not very useful to me. I decide to integrate
some simple date mining techniques to achieve it.


2006/2/19, Howie Wang <[EMAIL PROTECTED]>:
>
> I think doing this sort of thing works out very well for niche search
> engines.
> Analyzing the contents of the page takes up some time, but it's just
> milliseconds
> per page. If you contrast this with actually fetching a page that you
> don't
> want
> (several seconds * num pages), you can see that the time savings are very
> much
> in your favor.
>
> I'm not sure if you'd create a URLFilter since I don't think that gives
> you
> easy
> access to the page contents. You could do it in an HtmlParseFilter. Just
> copy the
> parse-html plugin, look for the bit of code where the Outlinks array is
> set.
> Then filter
> that Outlinks array as you see fit.
>
> One thing to be careful about is using regular expressions in Java to
> analyze the
> page contents. I've had lots of problems with hanging using
> java.util.regex.
> I get
> this with perfectly legal regex's, and it's only on certain pages that I
> get
> problems.
> It's not as big a problem for me since most of my regex stuff is during
> the
> indexing phase, and it's easy to re-index. If it happens during the fetch,
> it's a bigger
> pain, since you have to recover from an aborted fetch. So you might want
> to
> do lots of small crawls, instead of big full crawls.
>
> Howie
>
>
> >I think this can be done by using a plug-in like url filter, but it seems
> >to
> >cause the performance problem of the crawling process. So I'd like to
> >listen
> >to your opinions. Is it possible or meaningful to crawl not just by links
> >but contents or terms?
>
>
>


--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。

Reply via email to