[Nutch-general] Re: Content-based Crawl vs Link-based Crawl?

Elwin Sun, 19 Feb 2006 01:22:05 -0800

Hi Howie,

  Thank you for valuable suggestion. I will consider it carefully.
  As I'm going to parse non-English (actually Chinese) pages, so I think
maybe  regular expressions are not very useful to me. I decide to integrate
some simple date mining techniques to achieve it.



2006/2/19, Howie Wang <[EMAIL PROTECTED]>:
>
> I think doing this sort of thing works out very well for niche search
> engines.
> Analyzing the contents of the page takes up some time, but it's just
> milliseconds
> per page. If you contrast this with actually fetching a page that you
> don't
> want
> (several seconds * num pages), you can see that the time savings are very
> much
> in your favor.
>
> I'm not sure if you'd create a URLFilter since I don't think that gives
> you
> easy
> access to the page contents. You could do it in an HtmlParseFilter. Just
> copy the
> parse-html plugin, look for the bit of code where the Outlinks array is
> set.
> Then filter
> that Outlinks array as you see fit.
>
> One thing to be careful about is using regular expressions in Java to
> analyze the
> page contents. I've had lots of problems with hanging using
> java.util.regex.
> I get
> this with perfectly legal regex's, and it's only on certain pages that I
> get
> problems.
> It's not as big a problem for me since most of my regex stuff is during
> the
> indexing phase, and it's easy to re-index. If it happens during the fetch,
> it's a bigger
> pain, since you have to recover from an aborted fetch. So you might want
> to
> do lots of small crawls, instead of big full crawls.
>
> Howie
>
>
> >I think this can be done by using a plug-in like url filter, but it seems
> >to
> >cause the performance problem of the crawling process. So I'd like to
> >listen
> >to your opinions. Is it possible or meaningful to crawl not just by links
> >but contents or terms?
>
>
>


--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

[Nutch-general] Re: Content-based Crawl vs Link-based Crawl?

Reply via email to