Kunsheng,

You want me to let you use my computers, you're going to use my network 
connection (my bandwidth), and you aren't going to tell me what you're 
doing because it is "hard to explain?"

... and you think that shouldn't raise an alarm?

Okay, URIs.  Only two accesses per URI.  If a single site has 360 URIs, 
that is 720 accesses.  There are 200 sites here (in round numbers) which 
is something around 144,000 accesses.

I don't care if that's spread over two weeks, but I do care if they're 
all in the same hour, especially if they won't in some way generate 
revenue for me or for my customers.

... and I don't want it at all if I don't know why you're scanning.

You're doing it in a distributed fashion, so I can't block your project 
by IP -- requests will come from anywhere.

If you think I'm being mean or unreasonable, I'm sorry.  I'm trying to 
give you an idea of what you're going to run into.

I'm trying to tell you you'll avoid a lot of this if you'd just explain 
what you're doing in detail.

"Detail" doesn't mean "input to Map-Reduce" -- tell me about the output 
from Map-Reduce.  Tell me how that is going to be used, and how that 
might improve the internet, or the browsing experience, or something 
useful like that.

Tell me what you hope to prove when you write papers based on this research.

Use your BOINC site to tell everyone.

-- Lynn

Kunsheng Chen wrote:
> Hi Lynn,
> 
> Regarding the research goal... it is really hard to explain, basically our 
> goal sits in a backbone framework in our hadoop DFS, and a web crawler is 
> chosen as a application that could returned information (URIs) here to 
> support some algorithms on the field of Map-Reduce.
> 
> Only two replications will be dispatched to different clients (so an URI will 
> be crawled no one than by two clients at the period, at at most cases, these 
> two clients won' run at the same time ). As an possibility for URI to be 
> revisited within 10 days will be 5% (controlled by our algorithm), so it will 
> never happen that 3000 clients crawl (or even two of them).
> 
> I know things like this really cause worries on privacy and network traffic, 
> but it really shouldn't in this way.
> 
> 
> 
> Thanks,
> 
> -Kunsheng
> 
> 
> --- On Sat, 9/12/09, Lynn W. Taylor <[email protected]> wrote:
> 
>> From: Lynn W. Taylor <[email protected]>
>> Subject: Re: [boinc_dev] Anansi is coming back
>> To: "Kunsheng Chen" <[email protected]>
>> Cc: "BOINC dev" <[email protected]>
>> Date: Saturday, September 12, 2009, 2:35 AM
>> Kunsheng,
>>
>> Please don't get me wrong, I'm trying to be helpful.
>>
>> I look at your site, and I don't see anything about the
>> research goals.
>>
>> Nothing about the possible benefits of this research.
>>
>> From your site, it doesn't even tell me that you just want
>> the URIs.
>>
>> Mostly, it doesn't tell me why I'd want to help.
>>
>> As far as damage: if 10,000 Anansi users decided to hit my
>> servers all at once, how would that differ from a botnet
>> attacking my server?
>>
>> ... and none of the information below is on your site.
>>
>> -- Lynn
>>
>> Kunsheng Chen wrote:
>>> Hi everyone,
>>>
>>>
>>> Sorry for the confusion in project description.
>>>
>>>
>>> Anansi is a distributed web crawler only for research
>> purposes,  with a good manner in complying with rules
>> in "robots.txt" defined by each URI host it meets.
>>>
>>> The goal of such a project is to explore as many as
>> URIs that could be directly reached by the public. The URIs
>> returned to the server is only used for a scheduling
>> algorithm implemented with Map-Reduce Framework, so * NO
>> harm * is doing to any public or individual.
>>>
>>>
>>> Information collected by Anansi is no more than "URIs"
>> itself,
>>> *NO page content* of will be returned;
>>>
>>> *NO E-mail address* will be returned;
>>>
>>> *No user/password* will be returned;
>>>
>>>
>>> One task contains one URI each time, in order to
>> reduce load of volunteers.
>>> The project has been running as internal one by couple
>> machines for a while and we are ooking for helps from the
>> public.
>>> The project link, yes, it is http://canis.csc.ncsu.edu:8005/anansi
>>>
>>>
>>>
>>> Thanks for your participation,
>>>
>>> -Kunsheng
>>>
>>>
>>>    
>>    _______________________________________________
>>> boinc_dev mailing list
>>> [email protected]
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>>
> 
> 
>       
> 
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to