P.S. Sorry to clutter the developer's list with this.  Hopefully it'll 
benefit someone else getting started.

Lynn W. Taylor wrote:
> Trust me, if Google wants to spider pages at too high a rate, they will 
> get complaints from us -- and they have ways we can tell them to slow 
> down their crawls.
> 
> Their crawlers are well-spaced in time, very server-friendly.
> 
> You're talking to someone who is very much interested in SEO.  That's 
> one of the things we sell our customers -- pages that show up in Google.
> 
> Pages that draw customers.
> 
> Even if Google was much more aggressive, we'd tolerate a lot from them 
> because they do generate traffic and revenue.
> 
> Your project is pure research.  It won't drive revenue to us, at least 
> not in the same way Google or Yahoo or Bing drives revenue.
> 
> So if you want volunteers, there should be a benefit.
> 
> I crunched for HashClash.  It seems that HashClash was very much like 
> your project, trying to help someone write their thesis.  I was 
> interested in it because I use cryptographic hashes daily, and I use 
> them in a way where clashes would be bad.
> 
> I crunch SETI because I believe.
> 
> You want people to "crunch" for you, give them something to believe in.
> 
> -- Lynn
> 
> Kunsheng Chen wrote:
>> Lynn,
>>
>>
>> I just update the project description in the web. 
>>
>> The goal of the project, or be more concrete on my thesis title, is the 
>> evaluation of accuracy and performance on a distributed web crawler. And 
>> BOINC is chosen as a platform in our project.
>>
>> I know that might be too big for you, basically we got BOINC server and as 
>> you know, clients. The BOINC server starts dispatching URIs to be crawled, 
>> and clients will return URI results in this URI page and return to the 
>> server.
>>
>> The Map-reduce program is something similar to google pagerank algorithm.
>> The input, is a URI with sub URIs in its page (from clients) , and output of 
>> such a map-reduce algorithm, is the file in the format of < URI, priority >, 
>> the priority is calculated by "in-degree" and "out-degree", timestamp and 
>> some changes compared to the previous crawling results, which all bases on 
>> the results we got in Hadoop DFS. The URIs with higher priorities will be 
>> queue in waiting list to be crawled. In this case, it is only a crawling 
>> application running in BOINC infrastructure, and the growth of URIs we got 
>> is natural in this case, with no intention to attack or creating network 
>> traffic.
>>
>> Moreoever, the crawling application, is defined with good manner by compying 
>> robots.txt for each URI that waited to be crawled.
>>
>> Your concern is valuable to me, actually it reminds me a lot of security 
>> concerns or privacy problems. but the thing we are doing, actually is 
>> similar to what google is doing, except we got no information to turn it 
>> into a searching engine as we don't care page content. Web pages designed 
>> nowadays are concerning about SEO, in which case they are trying to please 
>> web spiders such as Google and baidu.  And what's more, the speed of our 
>> crawler is much slower...  I really dont see any point that this will do 
>> harm to them if they allow google and baidu to crawl thousands times faster 
>> while feel dangerous of Anansi.  
>>
>> I am trying nothing but to be persuasive in such a project using BOINC,  and 
>> of course, the "research" is may sometimes look weird and unreasonable in a 
>> real world, but that's really what the project is doing.
>>
>>
>> Best,
>>
>> -Kunsheng
>>
>>
>>
>> --- On Sat, 9/12/09, Lynn W. Taylor <[email protected]> wrote:
>>
>>> From: Lynn W. Taylor <[email protected]>
>>> Subject: Re: [boinc_dev] Anansi is coming back
>>> To: "Kunsheng Chen" <[email protected]>
>>> Cc: "BOINC dev" <[email protected]>
>>> Date: Saturday, September 12, 2009, 9:31 PM
>>> Kunsheng,
>>>
>>> You want me to let you use my computers, you're going to
>>> use my network 
>>> connection (my bandwidth), and you aren't going to tell me
>>> what you're 
>>> doing because it is "hard to explain?"
>>>
>>> ... and you think that shouldn't raise an alarm?
>>>
>>> Okay, URIs.  Only two accesses per URI.  If a
>>> single site has 360 URIs, 
>>> that is 720 accesses.  There are 200 sites here (in
>>> round numbers) which 
>>> is something around 144,000 accesses.
>>>
>>> I don't care if that's spread over two weeks, but I do care
>>> if they're 
>>> all in the same hour, especially if they won't in some way
>>> generate 
>>> revenue for me or for my customers.
>>>
>>> ... and I don't want it at all if I don't know why you're
>>> scanning.
>>>
>>> You're doing it in a distributed fashion, so I can't block
>>> your project 
>>> by IP -- requests will come from anywhere.
>>>
>>> If you think I'm being mean or unreasonable, I'm
>>> sorry.  I'm trying to 
>>> give you an idea of what you're going to run into.
>>>
>>> I'm trying to tell you you'll avoid a lot of this if you'd
>>> just explain 
>>> what you're doing in detail.
>>>
>>> "Detail" doesn't mean "input to Map-Reduce" -- tell me
>>> about the output 
>>> from Map-Reduce.  Tell me how that is going to be
>>> used, and how that 
>>> might improve the internet, or the browsing experience, or
>>> something 
>>> useful like that.
>>>
>>> Tell me what you hope to prove when you write papers based
>>> on this research.
>>>
>>> Use your BOINC site to tell everyone.
>>>
>>> -- Lynn
>>>
>>> Kunsheng Chen wrote:
>>>> Hi Lynn,
>>>>
>>>> Regarding the research goal... it is really hard to
>>> explain, basically our goal sits in a backbone framework in
>>> our hadoop DFS, and a web crawler is chosen as a application
>>> that could returned information (URIs) here to support some
>>> algorithms on the field of Map-Reduce.
>>>> Only two replications will be dispatched to different
>>> clients (so an URI will be crawled no one than by two
>>> clients at the period, at at most cases, these two clients
>>> won' run at the same time ). As an possibility for URI to be
>>> revisited within 10 days will be 5% (controlled by our
>>> algorithm), so it will never happen that 3000 clients crawl
>>> (or even two of them).
>>>> I know things like this really cause worries on
>>> privacy and network traffic, but it really shouldn't in this
>>> way.
>>>>
>>>> Thanks,
>>>>
>>>> -Kunsheng
>>>>
>>>>
>>>> --- On Sat, 9/12/09, Lynn W. Taylor <[email protected]>
>>> wrote:
>>>>> From: Lynn W. Taylor <[email protected]>
>>>>> Subject: Re: [boinc_dev] Anansi is coming back
>>>>> To: "Kunsheng Chen" <[email protected]>
>>>>> Cc: "BOINC dev" <[email protected]>
>>>>> Date: Saturday, September 12, 2009, 2:35 AM
>>>>> Kunsheng,
>>>>>
>>>>> Please don't get me wrong, I'm trying to be
>>> helpful.
>>>>> I look at your site, and I don't see anything
>>> about the
>>>>> research goals.
>>>>>
>>>>> Nothing about the possible benefits of this
>>> research.
>>>>> From your site, it doesn't even tell me that you
>>> just want
>>>>> the URIs.
>>>>>
>>>>> Mostly, it doesn't tell me why I'd want to help.
>>>>>
>>>>> As far as damage: if 10,000 Anansi users decided
>>> to hit my
>>>>> servers all at once, how would that differ from a
>>> botnet
>>>>> attacking my server?
>>>>>
>>>>> ... and none of the information below is on your
>>> site.
>>>>> -- Lynn
>>>>>
>>>>> Kunsheng Chen wrote:
>>>>>> Hi everyone,
>>>>>>
>>>>>>
>>>>>> Sorry for the confusion in project
>>> description.
>>>>>> Anansi is a distributed web crawler only for
>>> research
>>>>> purposes,  with a good manner in complying
>>> with rules
>>>>> in "robots.txt" defined by each URI host it
>>> meets.
>>>>>> The goal of such a project is to explore as
>>> many as
>>>>> URIs that could be directly reached by the public.
>>> The URIs
>>>>> returned to the server is only used for a
>>> scheduling
>>>>> algorithm implemented with Map-Reduce Framework,
>>> so * NO
>>>>> harm * is doing to any public or individual.
>>>>>> Information collected by Anansi is no more
>>> than "URIs"
>>>>> itself,
>>>>>> *NO page content* of will be returned;
>>>>>>
>>>>>> *NO E-mail address* will be returned;
>>>>>>
>>>>>> *No user/password* will be returned;
>>>>>>
>>>>>>
>>>>>> One task contains one URI each time, in order
>>> to
>>>>> reduce load of volunteers.
>>>>>> The project has been running as internal one
>>> by couple
>>>>> machines for a while and we are ooking for helps
>>> from the
>>>>> public.
>>>>>> The project link, yes, it is http://canis.csc.ncsu.edu:8005/anansi
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for your participation,
>>>>>>
>>>>>> -Kunsheng
>>>>>>
>>>>>>
>>>>>>     
>>>>>    
>>> _______________________________________________
>>>>>> boinc_dev mailing list
>>>>>> [email protected]
>>>>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>>>>> To unsubscribe, visit the above URL and
>>>>>> (near bottom of page) enter your email
>>> address.
>>>>        
>>>>
>>
>>       
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
>>
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
> 
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to