> Let's pause and appreciate the learning opportunity. > After considerable amount of unsubstantiated speculation about bugs in the code, and even the underlying soundness of the algorithm, the problem turned out to be a basic misuse. One which was impossible to guess from the information given.
> It's also a learning on my part, in fact about something that I thought I already knew: It's a pointless exercise to try to troubleshoot something without a _minimal_ and _complete_ example at hand. I think you may have misunderstood me. Sorry if I wasn't communicating clearly. I made a typo in my second to last email swapping the numbers for "directed" and "undirected". I sent another email immediately after correcting the typo. Hope you saw that. I wasn't saying I'd been making a silly mistake this whole time (as far as I can tell...). I was saying that I accidentally set my (properly) directed graph to undirected and the performance skyrocketed. Unfortunately, that performance improvement was just the result of a mistake on my part. I didn't realize graph-tool's behavior for setting a graph to undirected was to turn reciprocal edges into 2 multiedges. So, when I removed the sampled edges I was only removing one of the two multiedges (in the fraction of the randomly selected which had a reciprocal). I.e., it was basically pure data leakage. Once I corrected that error by only sampling edges which don't have a reciprocal the performance between undirected/directed mostly disappeared. So, performance still bad. To ensure it wasn't just my network or something to do with directedness I tried several simple undirected graphs from graph-tool's collection (where I don't need to worry about multiedges). All performed poorly. Since I haven't seen this network reconstruction perform well on any of the graphs I've tested I've given up on this approach. `get_edges_prob` is working very well for me, so I'll just stick to that, despite its slowness. Although, this possibility occurred to me: I wonder how the reconstruction would do if I fed back the probabilities that I get from my binary classifier using the method described in the "Incorporating Extrinsic Uncertainty Estimates" section of your paper. I could regard it as a final step in a link prediction flow. Seems plausible to me that with this extra information the reconstruction might work better. I'm assuming no one has tried this yet? > The directed SBM does not generate reciprocal edges very well, hence it is not a good predictor in this case. Where can I read more about this? I assumed the SBM handles the directed case just fine, even if there are many reciprocal edges. Does this have implications for community detection as well? > That is in fact very similar to the conditional probability computed by get_edges_prob(). The latter is more accurate, yields an actual normalized probability, and includes information from the priors, but for sufficiently sparse and large graphs, both computations should coincide. Hmm, I didn't realize this. Thanks. Thanks for your help, as always On Thu, Apr 16, 2020 at 2:47 AM Tiago de Paula Peixoto <[email protected]> wrote: > Am 16.04.20 um 02:54 schrieb Deklan Webster: > > I, totally by accident, think I discovered the issue! In the process of > > setting up the testing between the reconstructive method, get_edges_prob > > method, and the simple stacked topological similarity classifier I > > accidentally set the graph to undirected without setting it back to > > directed. The results appeared to be reasonable finally. I've gone back > > and tested on the graph I've been testing on, and indeed: > > > > Undirected: ~13/1000 correctly identified > > Directed: 488/1000 correctly identified > > > > A giant improvement! Of course, I'm relieved since this solves my > > problem (I can just, as least, set to undirected to compute this). > > Let's pause and appreciate the learning opportunity. > > After considerable amount of unsubstantiated speculation about bugs in > the code, and even the underlying soundness of the algorithm, the > problem turned out to be a basic misuse. One which was impossible to > guess from the information given. > > It's also a learning on my part, in fact about something that I thought > I already knew: It's a pointless exercise to try to troubleshoot > something without a _minimal_ and _complete_ example at hand. > > > But, > > now I'm wondering what's going on. Is there some predictable reason why > > the directed case is performing so much worse? Something to do with > > directed graphs and equlibration? But if it were to do with equlibration > > why did this issue never affect `get_edges_prob`'s performance? Could > > there be a bug in how graph-tool is handling the directed case? In the > > paper it seems you've worked with the directed case before, so I assume > > you would have noticed if there were. > > I'm done with the wild guesses, this is not productive. > > > To be sure, I tested this on the "polblogs" graph as well. Undirected: > > reconstructive performance is great, directed: terrible. > > The polblogs network I actually have investigated a bit in the past. > Despite being directed, most edges of that network are actually > reciprocal, so in fact it is mostly an undirected network. The directed > SBM does not generate reciprocal edges very well, hence it is not a good > predictor in this case. The undirected model is a better fit. > > > I've read the documentation. The docstring says "Inverse temperature". > > The relevant section of the docs say, > > > >> An alternative is to use a greedy algorithm based on a merge-split > > MCMC with zero temperature [peixoto-merge-split-2020] > > < > https://graph-tool.skewed.de/static/doc/demos/inference/inference.html#peixoto-merge-split-2020 > >, > > which requires a single global state, and thus can reduce memory usage. > > > > Just from this I don't understand. Not setting beta to infinity has > > already reduced memory usage for me. I don't understand the significance > > of this being greedy or why it doesn't apply to my situation.If the > > explanation of all this is in one of the papers please point to it and > > I'll read it. > > The documentation assumes a basic understanding of how MCMC works, > without which a proper use of the the methods described is not possible, > unfortunately. > > The MCMC algorithm (for reconstruction) attempts to sample from a > distribution \propto P(A,b)^beta where A is the network, b is a > partition and beta is the inverse temperate parameter. Setting beta=1 > means sampling from P(A,b), whereas beta=infinity means finding a single > (A,b) that maximizes the distribution. The choice beta=infinity also > breaks ergodicity and detailed balance, making it a greedy heuristic, > rather than a proper MCMC. > > To compute marginal probabilities you need beta=1, i.e. sample from the > joint distribution, not maximize it. > > The point about reducing memory applied only when comparing to > minimize_blockmodel_dl(). > > >> This kind of comparison has been done already in > > https://arxiv.org/abs/1802.10582 and https://arxiv.org/abs/1909.07578. > > The SBM approach is the single best classifier among the over a hundred > > they consider, which is marginally beat only by a stacking of around 40 > > other predictors. > > > > I've read both these papers and referred to the latter twice in this > > thread. We've already discussed that these don't use either > > `get_edges_prob` nor the full reconstructive marginal edge probability > > > > On page 8 of the former, they're using this as their similarity for SBM: > > "s_ij = θ_i θ_j * l_gi,gj" > > That is in fact very similar to the conditional probability computed by > get_edges_prob(). The latter is more accurate, yields an actual > normalized probability, and includes information from the priors, but > for sufficiently sparse and large graphs, both computations should > coincide. > > Best, > Tiago > > -- > Tiago de Paula Peixoto <[email protected]> > > _______________________________________________ > graph-tool mailing list > [email protected] > https://lists.skewed.de/mailman/listinfo/graph-tool >
_______________________________________________ graph-tool mailing list [email protected] https://lists.skewed.de/mailman/listinfo/graph-tool
