Thanks for the quick reply > Note that link prediction itself is a valid criterion for model selection, so you might just want to focus on that.
Not sure how to interpret this. Are you saying that even if the non-Poisson model has lower entropy, the Poisson model may still be 'better' for community detection if it's better at link prediction (which you suggest is most likely the case)? > I'm not sure what you mean. The site seems fine. Search on this page https://git.skewed.de/count0/graph-tool has always given me a 500 error. > You should use logsumexp(): Wouldn't that be non-incremental? Wouldn't the incremental version of this be summing the exp(a) of the results at each step, and then taking log at the end? > The problem with this is that it will not resolve very small probabilities, since the edge would never be seen in such a case. But the whole approach would be much faster. Yes, I ran into this problem. I'll paste the code to make sure I'm not doing anything wrong, ``` n = G.new_ep('int', 1) x = G.new_ep('int', 1) state = gt.MeasuredBlockState(G, n=n, x=x, n_default=1, x_default=0, nested=True, self_loops=True, state_args={'deg_corr': True}) gt.mcmc_equilibrate(state, wait=1000, epsilon=1e-5, mcmc_args=dict(niter=10), multiflip=True, verbose=True ) u = None def collect_marginals(s): global u u = s.collect_marginal(u) gt.mcmc_equilibrate(state, force_niter=1000, mcmc_args=dict(niter=10), multiflip=True, verbose=True, callback=collect_marginals ) eprob = u.ep.eprob non_edge_found_c = 0 edge_not_found_c = 0 for x in G.vertices(): for y in G.vertices(): edge = G.edge(x, y) u_edge = u.edge(x, y) if not edge: if u_edge: non_edge_found_c += 1 print("Non-edge in original, found in marginal graph.") print(u_edge, eprob[u_edge]) print() else: if not u_edge: edge_not_found_c += 1 print("Edge in original graph, but not an edge in marginal graph.") print(edge) print() print(non_edge_found_c, edge_not_found_c) ``` I'm assuming that `u = s.collect_marginal(u)` is taking care of counting the non-edge appearances over every sample (and not just the last one). I assume if u.edge(x, y) is None that means the non-edge was never observed, correct? Well, I ran it just now on a smaller graph to test. The graph is directed with 475 nodes and 20,911 edges. Intuitively I would expect a reasonable number of non-edges to be observed given that there are 21k edges... maybe 500~ at least. Running that code above I see that `non_edge_found_c` is 13. Only 13 non-edges are observed with non-zero probability? The largest of those 13 probabilities is 0.07. And, edge_not_found_c is 2. How should I interpret this situation where the edge is in the original but not in (any of?) the marginals? Am I doing something wrong here? Do I need to adjust n, x, n_default, x_default? Thanks for your help, as always On Tue, Apr 7, 2020 at 5:15 PM Tiago de Paula Peixoto <[email protected]> wrote: > Am 07.04.20 um 22:03 schrieb Deklan Webster: > > To make sure I understand about the model selection, I ran each of the > > four possibilities thrice for a fixed 1000 iterations > > > > ``` > > Directed: True > > Num vertices 3672 > > Num edges 312779 > > > > Entropy for degr_corr=True, poisson=True: 1044304.4179465043 > > Entropy for degr_corr=True, poisson=True: 1044708.7346586153 > > Entropy for degr_corr=True, poisson=True: 1044863.5150718079 > > > > Entropy for degr_corr=False, poisson=True: 1094026.9370843305 > > Entropy for degr_corr=False, poisson=True: 1093860.5158280218 > > Entropy for degr_corr=False, poisson=True: 1095110.0929428462 > > > > Entropy for degr_corr=True, poisson=False: 943149.5746761584 > > Entropy for degr_corr=True, poisson=False: 943741.5558787254 > > Entropy for degr_corr=True, poisson=False: 943772.2769395269 > > > > Entropy for degr_corr=False, poisson=False: 1000768.068855249 > > Entropy for degr_corr=False, poisson=False: 998721.4409976124 > > Entropy for degr_corr=False, poisson=False: 999301.5197368631 > > ``` > > > > So, is this the following valid?: degree correction improves the result > > in both cases. But, the Poisson multigraph doesn't make an improvement. > > So, for community detection I should just stick with a regular > > degree-corrected NestedBlockState. > > The Poisson multigraph will rarely improve things when doing > minimization, i.e. finding the most likely partition. This is is because > the mode of the distribution tends to be on simple graphs. However, it > will improve things when averaging over partitions. But in that case you > need to compute the evidence, not the description length, to compare > models, but that is more difficult to do (a simple approximation is > shown in the documentation). > > Note that link prediction itself is a valid criterion for model > selection, so you might just want to focus on that. > > > Does this have any implications for > > what is optimal for link prediction? > > Averaging over partitions is always better for link prediction, and you > need a model that works better when averaged. I would be very surprised > the Poisson model does not work better for link prediction in general. > > > Also, I gave `MeasuredBlockState.get_edges_prob` a shot. To try to get > > the run-time down I only considered vertex pairs at a distance of 1 or 2 > > (in the directed sense). That gives about 6m pairs. On my reasonably > > fast laptop each sampling iteration took 4 minutes. I just did 100 > > iterations total over about ~7 hours. Is it possible to speed this up? > > An approach which is linear on the number of nodes, rather than > quadratic, is to collect the edge marginals as it is explained in the > documentation, rather than the actual conditional probabilities. In > other words, you just count how often an edge is seen or not in the > posterior distribution. > > The problem with this is that it will not resolve very small > probabilities, since the edge would never be seen in such a case. But > the whole approach would be much faster. > > > I tried to search for the code on Gitlab to check how it's implemented > but > > I'm getting a 500 error on every search (it's been like this for me > > forever). > > I'm not sure what you mean. The site seems fine. > > > Am I likely to get substantially improved results with more > > than 100 iterations? > > I have no idea. > > > To try to save memory I just added the log probs together at each step > > and took their arithmetic mean at the end (as you can see). Is there a > > more meaningful mean to use here that can be computed incrementally? > > You should use logsumexp(): > > > https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.logsumexp.html > > > Anyway, using only 100 sampling iterations and using the arithmetic mean > > of the log probs, this feature ended up being the most important feature > > (according to SHAP) by a pretty large margin. Seems to confirm what you > > were speculating on Twitter. Here are two images illustrating that: > > > > On every test instance: > > https://i.imgur.com/Cx9tLJu.png > > > > On top 100 most likely test instances: > > https://i.imgur.com/bfopyQj.png > > > > For reference, the precision@100 here is 96%. > > > > So, pretty good even though it's by far the most time-costly feature to > > compute. Superposed random walks and Katz index take < 1 minute each, > > for instance. > > Well, there is no free lunch. > > Best, > Tiago > > -- > Tiago de Paula Peixoto <[email protected]> > > _______________________________________________ > graph-tool mailing list > [email protected] > https://lists.skewed.de/mailman/listinfo/graph-tool >
_______________________________________________ graph-tool mailing list [email protected] https://lists.skewed.de/mailman/listinfo/graph-tool
