P.S. Sorry to clutter the developer's list with this. Hopefully it'll benefit someone else getting started.
Lynn W. Taylor wrote: > Trust me, if Google wants to spider pages at too high a rate, they will > get complaints from us -- and they have ways we can tell them to slow > down their crawls. > > Their crawlers are well-spaced in time, very server-friendly. > > You're talking to someone who is very much interested in SEO. That's > one of the things we sell our customers -- pages that show up in Google. > > Pages that draw customers. > > Even if Google was much more aggressive, we'd tolerate a lot from them > because they do generate traffic and revenue. > > Your project is pure research. It won't drive revenue to us, at least > not in the same way Google or Yahoo or Bing drives revenue. > > So if you want volunteers, there should be a benefit. > > I crunched for HashClash. It seems that HashClash was very much like > your project, trying to help someone write their thesis. I was > interested in it because I use cryptographic hashes daily, and I use > them in a way where clashes would be bad. > > I crunch SETI because I believe. > > You want people to "crunch" for you, give them something to believe in. > > -- Lynn > > Kunsheng Chen wrote: >> Lynn, >> >> >> I just update the project description in the web. >> >> The goal of the project, or be more concrete on my thesis title, is the >> evaluation of accuracy and performance on a distributed web crawler. And >> BOINC is chosen as a platform in our project. >> >> I know that might be too big for you, basically we got BOINC server and as >> you know, clients. The BOINC server starts dispatching URIs to be crawled, >> and clients will return URI results in this URI page and return to the >> server. >> >> The Map-reduce program is something similar to google pagerank algorithm. >> The input, is a URI with sub URIs in its page (from clients) , and output of >> such a map-reduce algorithm, is the file in the format of < URI, priority >, >> the priority is calculated by "in-degree" and "out-degree", timestamp and >> some changes compared to the previous crawling results, which all bases on >> the results we got in Hadoop DFS. The URIs with higher priorities will be >> queue in waiting list to be crawled. In this case, it is only a crawling >> application running in BOINC infrastructure, and the growth of URIs we got >> is natural in this case, with no intention to attack or creating network >> traffic. >> >> Moreoever, the crawling application, is defined with good manner by compying >> robots.txt for each URI that waited to be crawled. >> >> Your concern is valuable to me, actually it reminds me a lot of security >> concerns or privacy problems. but the thing we are doing, actually is >> similar to what google is doing, except we got no information to turn it >> into a searching engine as we don't care page content. Web pages designed >> nowadays are concerning about SEO, in which case they are trying to please >> web spiders such as Google and baidu. And what's more, the speed of our >> crawler is much slower... I really dont see any point that this will do >> harm to them if they allow google and baidu to crawl thousands times faster >> while feel dangerous of Anansi. >> >> I am trying nothing but to be persuasive in such a project using BOINC, and >> of course, the "research" is may sometimes look weird and unreasonable in a >> real world, but that's really what the project is doing. >> >> >> Best, >> >> -Kunsheng >> >> >> >> --- On Sat, 9/12/09, Lynn W. Taylor <[email protected]> wrote: >> >>> From: Lynn W. Taylor <[email protected]> >>> Subject: Re: [boinc_dev] Anansi is coming back >>> To: "Kunsheng Chen" <[email protected]> >>> Cc: "BOINC dev" <[email protected]> >>> Date: Saturday, September 12, 2009, 9:31 PM >>> Kunsheng, >>> >>> You want me to let you use my computers, you're going to >>> use my network >>> connection (my bandwidth), and you aren't going to tell me >>> what you're >>> doing because it is "hard to explain?" >>> >>> ... and you think that shouldn't raise an alarm? >>> >>> Okay, URIs. Only two accesses per URI. If a >>> single site has 360 URIs, >>> that is 720 accesses. There are 200 sites here (in >>> round numbers) which >>> is something around 144,000 accesses. >>> >>> I don't care if that's spread over two weeks, but I do care >>> if they're >>> all in the same hour, especially if they won't in some way >>> generate >>> revenue for me or for my customers. >>> >>> ... and I don't want it at all if I don't know why you're >>> scanning. >>> >>> You're doing it in a distributed fashion, so I can't block >>> your project >>> by IP -- requests will come from anywhere. >>> >>> If you think I'm being mean or unreasonable, I'm >>> sorry. I'm trying to >>> give you an idea of what you're going to run into. >>> >>> I'm trying to tell you you'll avoid a lot of this if you'd >>> just explain >>> what you're doing in detail. >>> >>> "Detail" doesn't mean "input to Map-Reduce" -- tell me >>> about the output >>> from Map-Reduce. Tell me how that is going to be >>> used, and how that >>> might improve the internet, or the browsing experience, or >>> something >>> useful like that. >>> >>> Tell me what you hope to prove when you write papers based >>> on this research. >>> >>> Use your BOINC site to tell everyone. >>> >>> -- Lynn >>> >>> Kunsheng Chen wrote: >>>> Hi Lynn, >>>> >>>> Regarding the research goal... it is really hard to >>> explain, basically our goal sits in a backbone framework in >>> our hadoop DFS, and a web crawler is chosen as a application >>> that could returned information (URIs) here to support some >>> algorithms on the field of Map-Reduce. >>>> Only two replications will be dispatched to different >>> clients (so an URI will be crawled no one than by two >>> clients at the period, at at most cases, these two clients >>> won' run at the same time ). As an possibility for URI to be >>> revisited within 10 days will be 5% (controlled by our >>> algorithm), so it will never happen that 3000 clients crawl >>> (or even two of them). >>>> I know things like this really cause worries on >>> privacy and network traffic, but it really shouldn't in this >>> way. >>>> >>>> Thanks, >>>> >>>> -Kunsheng >>>> >>>> >>>> --- On Sat, 9/12/09, Lynn W. Taylor <[email protected]> >>> wrote: >>>>> From: Lynn W. Taylor <[email protected]> >>>>> Subject: Re: [boinc_dev] Anansi is coming back >>>>> To: "Kunsheng Chen" <[email protected]> >>>>> Cc: "BOINC dev" <[email protected]> >>>>> Date: Saturday, September 12, 2009, 2:35 AM >>>>> Kunsheng, >>>>> >>>>> Please don't get me wrong, I'm trying to be >>> helpful. >>>>> I look at your site, and I don't see anything >>> about the >>>>> research goals. >>>>> >>>>> Nothing about the possible benefits of this >>> research. >>>>> From your site, it doesn't even tell me that you >>> just want >>>>> the URIs. >>>>> >>>>> Mostly, it doesn't tell me why I'd want to help. >>>>> >>>>> As far as damage: if 10,000 Anansi users decided >>> to hit my >>>>> servers all at once, how would that differ from a >>> botnet >>>>> attacking my server? >>>>> >>>>> ... and none of the information below is on your >>> site. >>>>> -- Lynn >>>>> >>>>> Kunsheng Chen wrote: >>>>>> Hi everyone, >>>>>> >>>>>> >>>>>> Sorry for the confusion in project >>> description. >>>>>> Anansi is a distributed web crawler only for >>> research >>>>> purposes, with a good manner in complying >>> with rules >>>>> in "robots.txt" defined by each URI host it >>> meets. >>>>>> The goal of such a project is to explore as >>> many as >>>>> URIs that could be directly reached by the public. >>> The URIs >>>>> returned to the server is only used for a >>> scheduling >>>>> algorithm implemented with Map-Reduce Framework, >>> so * NO >>>>> harm * is doing to any public or individual. >>>>>> Information collected by Anansi is no more >>> than "URIs" >>>>> itself, >>>>>> *NO page content* of will be returned; >>>>>> >>>>>> *NO E-mail address* will be returned; >>>>>> >>>>>> *No user/password* will be returned; >>>>>> >>>>>> >>>>>> One task contains one URI each time, in order >>> to >>>>> reduce load of volunteers. >>>>>> The project has been running as internal one >>> by couple >>>>> machines for a while and we are ooking for helps >>> from the >>>>> public. >>>>>> The project link, yes, it is http://canis.csc.ncsu.edu:8005/anansi >>>>>> >>>>>> >>>>>> >>>>>> Thanks for your participation, >>>>>> >>>>>> -Kunsheng >>>>>> >>>>>> >>>>>> >>>>> >>> _______________________________________________ >>>>>> boinc_dev mailing list >>>>>> [email protected] >>>>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>>>>> To unsubscribe, visit the above URL and >>>>>> (near bottom of page) enter your email >>> address. >>>> >>>> >> >> >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
