Dennis Kubes
Tue, 15 Jul 2008 07:57:32 -0700
Right now the problem is not so much the run time as the intermediate output. A two depth run, which identifies reciprocal and 2-node cycles such as A->B->C->A creates a few hundred gigs of output on an 800M url database. A three depth run to identify A->B->C->D->A generates > 10T of intermediate output.
What I am really trying to do is find a more efficient way to either: 1) Store the webgraph to apply a computation2) Process the graph as in walking a tree in a way that doesn't lead to storing (processing is fine) outlinks of outlinks of outlinks, etc.
Don't know if that helps to explain the problem or not. Dennis hank williams wrote:
On Tue, Jul 15, 2008 at 10:20 AM, brainstorm <[EMAIL PROTECTED]> wrote:On Tue, Jul 15, 2008 at 4:12 PM, hank williams <[EMAIL PROTECTED]> wrote:Note that these algorithms will *not* resolve the fundamentallyexponentialnature of the problem,O(|V|+|E|), IIRCTrue. The problem is not actually exponential. It is the *dataset* that is when you talk about the web graph. The problem is essentially linear with the size of the dataset. If your dataset is constrained (by setting a limit) then so is your execution.efficiently as possible, but this still involves essentially visitingeverynode in the tree at least once until you hit a cycle in the tree ordecideyou have traveled too far from the start. Moreover, after quicklyperusing(in other words I could have missed it) I did not see specific mention of setting a depth of search limit. it is critical that you set a limit oryouwill end up attempting to spider the entire internet in search of cycles. The examples shown in the references are (for good reason) constrained in terms of node count, but on the real web the pool is (obviously) muchlargerand the problem becomes fundamentally different. Hank On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:Excellent. Thanks was exactly what I needed. Thanks Andrzej. Dennis Andrzej Bialecki wrote:Dennis Kubes wrote:Does anybody know how to efficiently (non-exponentially) walk a webgraphto detect cycles. This would be very useful in identifying spammywebpageand tight knit communities. I have a program that I will be releasing soon that does the detection through converting a webgraph into a tree and walking the tree nodes,but itis exponential in terms of intermediate map reduce output andcomputation.Perhaps this could be helpful: * http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf><http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last section.* http://citeseer.ist.psu.edu/395003.html * http://en.wikipedia.org/wiki/Cycle_detection-- blog: whydoeseverythingsuck.com