True you would definitely want to set a depth limit. Not just to limit
run time but also because of the fundamental nature of the problem.
Reciprocal links and link loops of 2,3,4,5, maybe even six tend to show
tight knit communities. Beyond that you start getting into the random
nature of the web connections.
Right now the problem is not so much the run time as the intermediate
output. A two depth run, which identifies reciprocal and 2-node cycles
such as A->B->C->A creates a few hundred gigs of output on an 800M url
database. A three depth run to identify A->B->C->D->A generates > 10T
of intermediate output.
What I am really trying to do is find a more efficient way to either:
1) Store the webgraph to apply a computation
2) Process the graph as in walking a tree in a way that doesn't lead to
storing (processing is fine) outlinks of outlinks of outlinks, etc.
Don't know if that helps to explain the problem or not.
Dennis
hank williams wrote:
On Tue, Jul 15, 2008 at 10:20 AM, brainstorm <[EMAIL PROTECTED]> wrote:
On Tue, Jul 15, 2008 at 4:12 PM, hank williams <[EMAIL PROTECTED]> wrote:
Note that these algorithms will *not* resolve the fundamentally
exponential
nature of the problem,
O(|V|+|E|), IIRC
True. The problem is not actually exponential. It is the *dataset* that is
when you talk about the web graph. The problem is essentially linear with
the size of the dataset. If your dataset is constrained (by setting a limit)
then so is your execution.
efficiently as possible, but this still involves essentially visiting
every
node in the tree at least once until you hit a cycle in the tree or
decide
you have traveled too far from the start. Moreover, after quickly
perusing
(in other words I could have missed it) I did not see specific mention of
setting a depth of search limit. it is critical that you set a limit or
you
will end up attempting to spider the entire internet in search of cycles.
The examples shown in the references are (for good reason) constrained in
terms of node count, but on the real web the pool is (obviously) much
larger
and the problem becomes fundamentally different.
Hank
On Tue, Jul 15, 2008 at 9:43 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
Excellent. Thanks was exactly what I needed. Thanks Andrzej.
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
Does anybody know how to efficiently (non-exponentially) walk a web
graph
to detect cycles. This would be very useful in identifying spammy
webpage
and tight knit communities.
I have a program that I will be releasing soon that does the detection
through converting a webgraph into a tree and walking the tree nodes,
but it
is exponential in terms of intermediate map reduce output and
computation.
Perhaps this could be helpful:
*
http://www.cs.berkeley.edu/~kamil/teaching/sp03/041403.pdf<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>
<http://www.cs.berkeley.edu/%7Ekamil/teaching/sp03/041403.pdf>see the last
section.
* http://citeseer.ist.psu.edu/395003.html
* http://en.wikipedia.org/wiki/Cycle_detection
--
blog: whydoeseverythingsuck.com