Lynn,

I just update the project description in the web. 

The goal of the project, or be more concrete on my thesis title, is the 
evaluation of accuracy and performance on a distributed web crawler. And BOINC 
is chosen as a platform in our project.

I know that might be too big for you, basically we got BOINC server and as you 
know, clients. The BOINC server starts dispatching URIs to be crawled, and 
clients will return URI results in this URI page and return to the server.

The Map-reduce program is something similar to google pagerank algorithm.
The input, is a URI with sub URIs in its page (from clients) , and output of 
such a map-reduce algorithm, is the file in the format of < URI, priority >, 
the priority is calculated by "in-degree" and "out-degree", timestamp and some 
changes compared to the previous crawling results, which all bases on the 
results we got in Hadoop DFS. The URIs with higher priorities will be queue in 
waiting list to be crawled. In this case, it is only a crawling application 
running in BOINC infrastructure, and the growth of URIs we got is natural in 
this case, with no intention to attack or creating network traffic.

Moreoever, the crawling application, is defined with good manner by compying 
robots.txt for each URI that waited to be crawled.

Your concern is valuable to me, actually it reminds me a lot of security 
concerns or privacy problems. but the thing we are doing, actually is similar 
to what google is doing, except we got no information to turn it into a 
searching engine as we don't care page content. Web pages designed nowadays are 
concerning about SEO, in which case they are trying to please web spiders such 
as Google and baidu.  And what's more, the speed of our crawler is much 
slower...  I really dont see any point that this will do harm to them if they 
allow google and baidu to crawl thousands times faster while feel dangerous of 
Anansi.  

I am trying nothing but to be persuasive in such a project using BOINC,  and of 
course, the "research" is may sometimes look weird and unreasonable in a real 
world, but that's really what the project is doing.


Best,

-Kunsheng



--- On Sat, 9/12/09, Lynn W. Taylor <[email protected]> wrote:

> From: Lynn W. Taylor <[email protected]>
> Subject: Re: [boinc_dev] Anansi is coming back
> To: "Kunsheng Chen" <[email protected]>
> Cc: "BOINC dev" <[email protected]>
> Date: Saturday, September 12, 2009, 9:31 PM
> Kunsheng,
> 
> You want me to let you use my computers, you're going to
> use my network 
> connection (my bandwidth), and you aren't going to tell me
> what you're 
> doing because it is "hard to explain?"
> 
> ... and you think that shouldn't raise an alarm?
> 
> Okay, URIs.  Only two accesses per URI.  If a
> single site has 360 URIs, 
> that is 720 accesses.  There are 200 sites here (in
> round numbers) which 
> is something around 144,000 accesses.
> 
> I don't care if that's spread over two weeks, but I do care
> if they're 
> all in the same hour, especially if they won't in some way
> generate 
> revenue for me or for my customers.
> 
> ... and I don't want it at all if I don't know why you're
> scanning.
> 
> You're doing it in a distributed fashion, so I can't block
> your project 
> by IP -- requests will come from anywhere.
> 
> If you think I'm being mean or unreasonable, I'm
> sorry.  I'm trying to 
> give you an idea of what you're going to run into.
> 
> I'm trying to tell you you'll avoid a lot of this if you'd
> just explain 
> what you're doing in detail.
> 
> "Detail" doesn't mean "input to Map-Reduce" -- tell me
> about the output 
> from Map-Reduce.  Tell me how that is going to be
> used, and how that 
> might improve the internet, or the browsing experience, or
> something 
> useful like that.
> 
> Tell me what you hope to prove when you write papers based
> on this research.
> 
> Use your BOINC site to tell everyone.
> 
> -- Lynn
> 
> Kunsheng Chen wrote:
> > Hi Lynn,
> > 
> > Regarding the research goal... it is really hard to
> explain, basically our goal sits in a backbone framework in
> our hadoop DFS, and a web crawler is chosen as a application
> that could returned information (URIs) here to support some
> algorithms on the field of Map-Reduce.
> > 
> > Only two replications will be dispatched to different
> clients (so an URI will be crawled no one than by two
> clients at the period, at at most cases, these two clients
> won' run at the same time ). As an possibility for URI to be
> revisited within 10 days will be 5% (controlled by our
> algorithm), so it will never happen that 3000 clients crawl
> (or even two of them).
> > 
> > I know things like this really cause worries on
> privacy and network traffic, but it really shouldn't in this
> way.
> > 
> > 
> > 
> > Thanks,
> > 
> > -Kunsheng
> > 
> > 
> > --- On Sat, 9/12/09, Lynn W. Taylor <[email protected]>
> wrote:
> > 
> >> From: Lynn W. Taylor <[email protected]>
> >> Subject: Re: [boinc_dev] Anansi is coming back
> >> To: "Kunsheng Chen" <[email protected]>
> >> Cc: "BOINC dev" <[email protected]>
> >> Date: Saturday, September 12, 2009, 2:35 AM
> >> Kunsheng,
> >>
> >> Please don't get me wrong, I'm trying to be
> helpful.
> >>
> >> I look at your site, and I don't see anything
> about the
> >> research goals.
> >>
> >> Nothing about the possible benefits of this
> research.
> >>
> >> From your site, it doesn't even tell me that you
> just want
> >> the URIs.
> >>
> >> Mostly, it doesn't tell me why I'd want to help.
> >>
> >> As far as damage: if 10,000 Anansi users decided
> to hit my
> >> servers all at once, how would that differ from a
> botnet
> >> attacking my server?
> >>
> >> ... and none of the information below is on your
> site.
> >>
> >> -- Lynn
> >>
> >> Kunsheng Chen wrote:
> >>> Hi everyone,
> >>>
> >>>
> >>> Sorry for the confusion in project
> description.
> >>>
> >>>
> >>> Anansi is a distributed web crawler only for
> research
> >> purposes,  with a good manner in complying
> with rules
> >> in "robots.txt" defined by each URI host it
> meets.
> >>>
> >>> The goal of such a project is to explore as
> many as
> >> URIs that could be directly reached by the public.
> The URIs
> >> returned to the server is only used for a
> scheduling
> >> algorithm implemented with Map-Reduce Framework,
> so * NO
> >> harm * is doing to any public or individual.
> >>>
> >>>
> >>> Information collected by Anansi is no more
> than "URIs"
> >> itself,
> >>> *NO page content* of will be returned;
> >>>
> >>> *NO E-mail address* will be returned;
> >>>
> >>> *No user/password* will be returned;
> >>>
> >>>
> >>> One task contains one URI each time, in order
> to
> >> reduce load of volunteers.
> >>> The project has been running as internal one
> by couple
> >> machines for a while and we are ooking for helps
> from the
> >> public.
> >>> The project link, yes, it is http://canis.csc.ncsu.edu:8005/anansi
> >>>
> >>>
> >>>
> >>> Thanks for your participation,
> >>>
> >>> -Kunsheng
> >>>
> >>>
> >>>    
> >>   
> _______________________________________________
> >>> boinc_dev mailing list
> >>> [email protected]
> >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> >>> To unsubscribe, visit the above URL and
> >>> (near bottom of page) enter your email
> address.
> >>>
> > 
> > 
> >       
> > 
> 


      
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to