Thanks for the quick reply

> Note that link prediction itself is a valid criterion for model
selection, so you might just want to focus on that.

Not sure how to interpret this. Are you saying that even if the non-Poisson
model has lower entropy, the Poisson model may still be 'better' for
community detection if it's better at link prediction (which you suggest is
most likely the case)?

> I'm not sure what you mean. The site seems fine.
Search on this page https://git.skewed.de/count0/graph-tool has always
given me a 500 error.

> You should use logsumexp():
Wouldn't that be non-incremental? Wouldn't the incremental version of this
be summing the exp(a) of the results at each step, and then taking log at
the end?

> The problem with this is that it will not resolve very small
probabilities, since the edge would never be seen in such a case. But
the whole approach would be much faster.

Yes, I ran into this problem. I'll paste the code to make sure I'm not
doing anything wrong,

```
n = G.new_ep('int', 1)
x = G.new_ep('int', 1)
state = gt.MeasuredBlockState(G, n=n, x=x, n_default=1, x_default=0,
nested=True, self_loops=True, state_args={'deg_corr': True})

gt.mcmc_equilibrate(state,
                wait=1000,
                epsilon=1e-5,
                mcmc_args=dict(niter=10),
                multiflip=True,
                verbose=True
                )

u = None

def collect_marginals(s):
    global u
    u = s.collect_marginal(u)

gt.mcmc_equilibrate(state,
                force_niter=1000,
                mcmc_args=dict(niter=10),
                multiflip=True,
                verbose=True, callback=collect_marginals
                )
eprob = u.ep.eprob

non_edge_found_c = 0
edge_not_found_c = 0

for x in G.vertices():
    for y in G.vertices():
        edge = G.edge(x, y)
        u_edge = u.edge(x, y)

        if not edge:
            if u_edge:
                non_edge_found_c += 1
                print("Non-edge in original, found in marginal graph.")
                print(u_edge, eprob[u_edge])
                print()
        else:
            if not u_edge:
                edge_not_found_c += 1
                print("Edge in original graph, but not an edge in marginal
graph.")
                print(edge)
                print()
print(non_edge_found_c, edge_not_found_c)
```

I'm assuming that `u = s.collect_marginal(u)` is taking care of counting
the non-edge appearances over every sample (and not just the last one). I
assume if u.edge(x, y) is None that means the non-edge was never observed,
correct?

Well, I ran it just now on a smaller graph to test. The graph is directed
with 475 nodes and 20,911 edges. Intuitively I would expect a reasonable
number of non-edges to be observed given that there are 21k edges... maybe
500~ at least. Running that code above I see that `non_edge_found_c` is 13.
Only 13 non-edges are observed with non-zero probability? The largest of
those 13 probabilities is 0.07. And, edge_not_found_c is 2. How should I
interpret this situation where the edge is in the original but not in (any
of?) the marginals?

Am I doing something wrong here? Do I need to adjust n, x, n_default,
x_default?

Thanks for your help, as always

On Tue, Apr 7, 2020 at 5:15 PM Tiago de Paula Peixoto <[email protected]>
wrote:

> Am 07.04.20 um 22:03 schrieb Deklan Webster:
> > To make sure I understand about the model selection, I ran each of the
> > four possibilities thrice for a fixed 1000 iterations
> >
> > ```
> > Directed: True
> > Num vertices 3672
> > Num edges 312779
> >
> > Entropy for degr_corr=True, poisson=True: 1044304.4179465043
> > Entropy for degr_corr=True, poisson=True: 1044708.7346586153
> > Entropy for degr_corr=True, poisson=True: 1044863.5150718079
> >
> > Entropy for degr_corr=False, poisson=True: 1094026.9370843305
> > Entropy for degr_corr=False, poisson=True: 1093860.5158280218
> > Entropy for degr_corr=False, poisson=True: 1095110.0929428462
> >
> > Entropy for degr_corr=True, poisson=False: 943149.5746761584
> > Entropy for degr_corr=True, poisson=False: 943741.5558787254
> > Entropy for degr_corr=True, poisson=False: 943772.2769395269
> >
> > Entropy for degr_corr=False, poisson=False: 1000768.068855249
> > Entropy for degr_corr=False, poisson=False: 998721.4409976124
> > Entropy for degr_corr=False, poisson=False: 999301.5197368631
> > ```
> >
> > So, is this the following valid?: degree correction improves the result
> > in both cases. But, the Poisson multigraph doesn't make an improvement.
> > So, for community detection I should just stick with a regular
> > degree-corrected NestedBlockState.
>
> The Poisson multigraph will rarely improve things when doing
> minimization, i.e. finding the most likely partition. This is is because
> the mode of the distribution tends to be on simple graphs. However, it
> will improve things when averaging over partitions. But in that case you
> need to compute the evidence, not the description length, to compare
> models, but that is more difficult to do (a simple approximation is
> shown in the documentation).
>
> Note that link prediction itself is a valid criterion for model
> selection, so you might just want to focus on that.
>
> > Does this have any implications for
> > what is optimal for link prediction?
>
> Averaging over partitions is always better for link prediction, and you
> need a model that works better when averaged. I would be very surprised
> the Poisson model does not work better for link prediction in general.
>
> > Also, I gave `MeasuredBlockState.get_edges_prob` a shot. To try to get
> > the run-time down I only considered vertex pairs at a distance of 1 or 2
> > (in the directed sense). That gives about 6m pairs. On my reasonably
> > fast laptop each sampling iteration took 4 minutes. I just did 100
> > iterations total over about ~7 hours. Is it possible to speed this up?
>
> An approach which is linear on the number of nodes, rather than
> quadratic, is to collect the edge marginals as it is explained in the
> documentation, rather than the actual conditional probabilities. In
> other words, you just count how often an edge is seen or not in the
> posterior distribution.
>
> The problem with this is that it will not resolve very small
> probabilities, since the edge would never be seen in such a case. But
> the whole approach would be much faster.
>
> > I tried to search for the code on Gitlab to check how it's implemented
> but
> > I'm getting a 500 error on every search (it's been like this for me
> > forever).
>
> I'm not sure what you mean. The site seems fine.
>
> > Am I likely to get substantially improved results with more
> > than 100 iterations?
>
> I have no idea.
>
> > To try to save memory I just added the log probs together at each step
> > and took their arithmetic mean at the end (as you can see). Is there a
> > more meaningful mean to use here that can be computed incrementally?
>
> You should use logsumexp():
>
>
> https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.logsumexp.html
>
> > Anyway, using only 100 sampling iterations and using the arithmetic mean
> > of the log probs, this feature ended up being the most important feature
> > (according to SHAP) by a pretty large margin. Seems to confirm what you
> > were speculating on Twitter. Here are two images illustrating that:
> >
> > On every test instance:
> > https://i.imgur.com/Cx9tLJu.png
> >
> > On top 100 most likely test instances:
> > https://i.imgur.com/bfopyQj.png
> >
> > For reference, the precision@100 here is 96%.
> >
> > So, pretty good even though it's by far the most time-costly feature to
> > compute. Superposed random walks and Katz index take < 1 minute each,
> > for instance.
>
> Well, there is no free lunch.
>
> Best,
> Tiago
>
> --
> Tiago de Paula Peixoto <[email protected]>
>
> _______________________________________________
> graph-tool mailing list
> [email protected]
> https://lists.skewed.de/mailman/listinfo/graph-tool
>
_______________________________________________
graph-tool mailing list
[email protected]
https://lists.skewed.de/mailman/listinfo/graph-tool

Reply via email to