Am 07.04.20 um 22:03 schrieb Deklan Webster:
> To make sure I understand about the model selection, I ran each of the
> four possibilities thrice for a fixed 1000 iterations
> 
> ```
> Directed: True
> Num vertices 3672
> Num edges 312779
> 
> Entropy for degr_corr=True, poisson=True: 1044304.4179465043
> Entropy for degr_corr=True, poisson=True: 1044708.7346586153
> Entropy for degr_corr=True, poisson=True: 1044863.5150718079
> 
> Entropy for degr_corr=False, poisson=True: 1094026.9370843305
> Entropy for degr_corr=False, poisson=True: 1093860.5158280218
> Entropy for degr_corr=False, poisson=True: 1095110.0929428462
> 
> Entropy for degr_corr=True, poisson=False: 943149.5746761584
> Entropy for degr_corr=True, poisson=False: 943741.5558787254
> Entropy for degr_corr=True, poisson=False: 943772.2769395269
> 
> Entropy for degr_corr=False, poisson=False: 1000768.068855249
> Entropy for degr_corr=False, poisson=False: 998721.4409976124
> Entropy for degr_corr=False, poisson=False: 999301.5197368631
> ```
> 
> So, is this the following valid?: degree correction improves the result
> in both cases. But, the Poisson multigraph doesn't make an improvement.
> So, for community detection I should just stick with a regular
> degree-corrected NestedBlockState.

The Poisson multigraph will rarely improve things when doing
minimization, i.e. finding the most likely partition. This is is because
the mode of the distribution tends to be on simple graphs. However, it
will improve things when averaging over partitions. But in that case you
need to compute the evidence, not the description length, to compare
models, but that is more difficult to do (a simple approximation is
shown in the documentation).

Note that link prediction itself is a valid criterion for model
selection, so you might just want to focus on that.

> Does this have any implications for
> what is optimal for link prediction?

Averaging over partitions is always better for link prediction, and you
need a model that works better when averaged. I would be very surprised
the Poisson model does not work better for link prediction in general.

> Also, I gave `MeasuredBlockState.get_edges_prob` a shot. To try to get
> the run-time down I only considered vertex pairs at a distance of 1 or 2
> (in the directed sense). That gives about 6m pairs. On my reasonably
> fast laptop each sampling iteration took 4 minutes. I just did 100
> iterations total over about ~7 hours. Is it possible to speed this up? 

An approach which is linear on the number of nodes, rather than
quadratic, is to collect the edge marginals as it is explained in the
documentation, rather than the actual conditional probabilities. In
other words, you just count how often an edge is seen or not in the
posterior distribution.

The problem with this is that it will not resolve very small
probabilities, since the edge would never be seen in such a case. But
the whole approach would be much faster.

> I tried to search for the code on Gitlab to check how it's implemented but
> I'm getting a 500 error on every search (it's been like this for me
> forever).

I'm not sure what you mean. The site seems fine.

> Am I likely to get substantially improved results with more
> than 100 iterations?

I have no idea.

> To try to save memory I just added the log probs together at each step
> and took their arithmetic mean at the end (as you can see). Is there a
> more meaningful mean to use here that can be computed incrementally?

You should use logsumexp():

https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.logsumexp.html

> Anyway, using only 100 sampling iterations and using the arithmetic mean
> of the log probs, this feature ended up being the most important feature
> (according to SHAP) by a pretty large margin. Seems to confirm what you
> were speculating on Twitter. Here are two images illustrating that:
> 
> On every test instance:
> https://i.imgur.com/Cx9tLJu.png
> 
> On top 100 most likely test instances:
> https://i.imgur.com/bfopyQj.png
> 
> For reference, the precision@100 here is 96%.
> 
> So, pretty good even though it's by far the most time-costly feature to
> compute. Superposed random walks and Katz index take < 1 minute each,
> for instance.

Well, there is no free lunch.

Best,
Tiago

-- 
Tiago de Paula Peixoto <[email protected]>

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
graph-tool mailing list
[email protected]
https://lists.skewed.de/mailman/listinfo/graph-tool

Reply via email to