To make sure I understand about the model selection, I ran each of the four
possibilities thrice for a fixed 1000 iterations
```
Directed: True
Num vertices 3672
Num edges 312779
Entropy for degr_corr=True, poisson=True: 1044304.4179465043
Entropy for degr_corr=True, poisson=True: 1044708.7346586153
Entropy for degr_corr=True, poisson=True: 1044863.5150718079
Entropy for degr_corr=False, poisson=True: 1094026.9370843305
Entropy for degr_corr=False, poisson=True: 1093860.5158280218
Entropy for degr_corr=False, poisson=True: 1095110.0929428462
Entropy for degr_corr=True, poisson=False: 943149.5746761584
Entropy for degr_corr=True, poisson=False: 943741.5558787254
Entropy for degr_corr=True, poisson=False: 943772.2769395269
Entropy for degr_corr=False, poisson=False: 1000768.068855249
Entropy for degr_corr=False, poisson=False: 998721.4409976124
Entropy for degr_corr=False, poisson=False: 999301.5197368631
```
So, is this the following valid?: degree correction improves the result in
both cases. But, the Poisson multigraph doesn't make an improvement. So,
for community detection I should just stick with a regular degree-corrected
NestedBlockState. Does this have any implications for what is optimal for
link prediction?
Also, I gave `MeasuredBlockState.get_edges_prob` a shot. To try to get the
run-time down I only considered vertex pairs at a distance of 1 or 2 (in
the directed sense). That gives about 6m pairs. On my reasonably fast
laptop each sampling iteration took 4 minutes. I just did 100 iterations
total over about ~7 hours. Is it possible to speed this up? I tried to
search for the code on Gitlab to check how it's implemented but I'm getting
a 500 error on every search (it's been like this for me forever). Am I
likely to get substantially improved results with more than 100 iterations?
```
marginal_sums = np.zeros(len(vertex_pairs))
def collect_marginals(s):
nonlocal marginal_sums
edges_prob = s.get_edges_prob(vertex_pairs)
marginal_sums = np.add(marginal_sums, edges_prob)
gt.mcmc_equilibrate(measured_state,
force_niter=SAMPLE_ITERS,
mcmc_args=dict(niter=10),
multiflip=True,
verbose=True, callback=collect_marginals
)
marginal_sums = marginal_sums / SAMPLE_ITERS
```
To try to save memory I just added the log probs together at each step and
took their arithmetic mean at the end (as you can see). Is there a more
meaningful mean to use here that can be computed incrementally?
Anyway, using only 100 sampling iterations and using the arithmetic mean of
the log probs, this feature ended up being the most important feature
(according to SHAP) by a pretty large margin. Seems to confirm what you
were speculating on Twitter. Here are two images illustrating that:
On every test instance:
https://i.imgur.com/Cx9tLJu.png
On top 100 most likely test instances:
https://i.imgur.com/bfopyQj.png
For reference, the precision@100 here is 96%.
So, pretty good even though it's by far the most time-costly feature to
compute. Superposed random walks and Katz index take < 1 minute each, for
instance.
Thanks for your help, as always
On Sat, Apr 4, 2020 at 6:20 PM Tiago de Paula Peixoto <[email protected]>
wrote:
> Am 03.04.20 um 01:03 schrieb Deklan Webster:
> > Thanks for the quick reply,
> >
> >> The same model selection principles still apply.
> >
> > So, would it be meaningful to try out 4 possibilities: DC or not, latent
> > multigraph or not, and then compare the entropies?
>
> Yes.
>
> > I didn't see in the docs where it says MeasuredBlockState uses the
> > latent Poisson multigraph. I thought the latter is new but the former
> > has been in graph-tool for awhile. Has the former been updated to always
> > use the latter?
>
> No, the measured models have always used latent multigraphs, as it's
> explained in the papers.
>
> > Will using MeasuredBlockState instead of LatentMultigraphBlockState
> > influence the community detection at all? In other words, if I'm
> > interested in predicting links and doing community detection (both as
> > accurately as possible) should I just use MeasuredBlockState all the
> time?
>
> The latent Poisson model using LatentMultigraphBlockState is not meant
> for reconstruction, as it assumes there is no measurement error. When
> you take that into account it becomes MeasuredBlockState.
>
> > In the other thread you recommend I use
> > "MeasuredBlockState.get_edge_prob()", but in the example in the docs I'm
> > seeing this
> >
> > eprob = u.ep.eprob
> > print("Posterior probability of edge (11, 36):", eprob[u.edge(11, 36)])
> >
> > What's the difference?
>
> The former gives the conditional probability, and the latter the
> marginal probability.
>
> > Btw, there appears to be a typo in the docs for MeasuredBlockState. The
> > x_default in the call signature has a default value of 0, but in the
> > explanation below it says 1.
>
> The function signature is correct, I'll fix the docstring.
>
> Best,
> Tiago
>
> --
> Tiago de Paula Peixoto <[email protected]>
> _______________________________________________
> graph-tool mailing list
> [email protected]
> https://lists.skewed.de/mailman/listinfo/graph-tool
>
_______________________________________________
graph-tool mailing list
[email protected]
https://lists.skewed.de/mailman/listinfo/graph-tool