Hi Lynn, Regarding the research goal... it is really hard to explain, basically our goal sits in a backbone framework in our hadoop DFS, and a web crawler is chosen as a application that could returned information (URIs) here to support some algorithms on the field of Map-Reduce.
Only two replications will be dispatched to different clients (so an URI will be crawled no one than by two clients at the period, at at most cases, these two clients won' run at the same time ). As an possibility for URI to be revisited within 10 days will be 5% (controlled by our algorithm), so it will never happen that 3000 clients crawl (or even two of them). I know things like this really cause worries on privacy and network traffic, but it really shouldn't in this way. Thanks, -Kunsheng --- On Sat, 9/12/09, Lynn W. Taylor <[email protected]> wrote: > From: Lynn W. Taylor <[email protected]> > Subject: Re: [boinc_dev] Anansi is coming back > To: "Kunsheng Chen" <[email protected]> > Cc: "BOINC dev" <[email protected]> > Date: Saturday, September 12, 2009, 2:35 AM > Kunsheng, > > Please don't get me wrong, I'm trying to be helpful. > > I look at your site, and I don't see anything about the > research goals. > > Nothing about the possible benefits of this research. > > From your site, it doesn't even tell me that you just want > the URIs. > > Mostly, it doesn't tell me why I'd want to help. > > As far as damage: if 10,000 Anansi users decided to hit my > servers all at once, how would that differ from a botnet > attacking my server? > > ... and none of the information below is on your site. > > -- Lynn > > Kunsheng Chen wrote: > > Hi everyone, > > > > > > Sorry for the confusion in project description. > > > > > > Anansi is a distributed web crawler only for research > purposes, with a good manner in complying with rules > in "robots.txt" defined by each URI host it meets. > > > > > > The goal of such a project is to explore as many as > URIs that could be directly reached by the public. The URIs > returned to the server is only used for a scheduling > algorithm implemented with Map-Reduce Framework, so * NO > harm * is doing to any public or individual. > > > > > > > > Information collected by Anansi is no more than "URIs" > itself, > > > > *NO page content* of will be returned; > > > > *NO E-mail address* will be returned; > > > > *No user/password* will be returned; > > > > > > One task contains one URI each time, in order to > reduce load of volunteers. > > > > The project has been running as internal one by couple > machines for a while and we are ooking for helps from the > public. > > > > The project link, yes, it is http://canis.csc.ncsu.edu:8005/anansi > > > > > > > > Thanks for your participation, > > > > -Kunsheng > > > > > > > _______________________________________________ > > boinc_dev mailing list > > [email protected] > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > > To unsubscribe, visit the above URL and > > (near bottom of page) enter your email address. > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
