Re: How to walk a webgraph?

Dennis Kubes Tue, 15 Jul 2008 07:57:32 -0700

True you would definitely want to set a depth limit. Not just to limitrun time but also because of the fundamental nature of the problem.Reciprocal links and link loops of 2,3,4,5, maybe even six tend to showtight knit communities. Beyond that you start getting into the randomnature of the web connections.

Right now the problem is not so much the run time as the intermediateoutput. A two depth run, which identifies reciprocal and 2-node cyclessuch as A->B->C->A creates a few hundred gigs of output on an 800M urldatabase. A three depth run to identify A->B->C->D->A generates > 10Tof intermediate output.


What I am really trying to do is find a more efficient way to either:

1) Store the webgraph to apply a computation

2) Process the graph as in walking a tree in a way that doesn't lead tostoring (processing is fine) outlinks of outlinks of outlinks, etc.


Don't know if that helps to explain the problem or not.

Dennis

hank williams wrote:

On Tue, Jul 15, 2008 at 10:20 AM, brainstorm <[EMAIL PROTECTED]> wrote:

On Tue, Jul 15, 2008 at 4:12 PM, hank williams <[EMAIL PROTECTED]> wrote:

Note that these algorithms will *not* resolve the fundamentally

exponential

nature of the problem,


O(|V|+|E|), IIRC


True. The problem is not actually exponential. It is the *dataset* that is
when you talk about the web graph. The problem is essentially linear with
the size of the dataset. If your dataset is constrained (by setting a limit)
then so is your execution.

efficiently as possible, but this still involves essentially visiting

every

node in the tree at least once until you hit a cycle in the tree or

decide

you have traveled too far from the start. Moreover, after quickly

perusing

(in other words I could have missed it) I did not see specific mention of
setting a depth of search limit. it is critical that you set a limit or

you

will end up attempting to spider the entire internet in search of cycles.
The examples shown in the references are (for good reason) constrained in
terms of node count, but on the real web the pool is (obviously) much

larger

and the problem becomes fundamentally different.

Hank

On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

Excellent.  Thanks was exactly what I needed.  Thanks Andrzej.

Dennis


Andrzej Bialecki wrote:

Dennis Kubes wrote:

Does anybody know how to efficiently (non-exponentially) walk a web

graph

to detect cycles.  This would be very useful in identifying spammy

webpage

and tight knit communities.

I have a program that I will be releasing soon that does the detection
through converting a webgraph into a tree and walking the tree nodes,

but it

is exponential in terms of intermediate map reduce output and

computation.

Perhaps this could be helpful:

* 
http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>

<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last
section.

* http://citeseer.ist.psu.edu/395003.html

* http://en.wikipedia.org/wiki/Cycle_detection


--
blog: whydoeseverythingsuck.com

Re: How to walk a webgraph?

Reply via email to