hi Kelvin:
Just a curious question.
As I know, the goal of nutch global crawling ability
will reach 10 billions page based on implementation of
map reduced.
OC, seeming to fall in the middle, is for control
industry domain crawling. How many sites is its'
goal?dealing with couple of thousand sites?
I believe the importance for industry domain crawling
is in-time updating. So identifying content of fetched
page and saving post-parsing time is critical.
thanks,
Michael Ji,
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs