Dear R users
I'm a novice user of R and have absolutely no prior knowledge of social network
analysis, so apologies if my question is trivial. I've spent alot of time
trying to solve this on my own but I really can't so hope someone here can help
me out. Cheers!
The dataset:
I'm trying to predict the existance of links (True or False) in a test set
using a training set. Both data sets are in an "edgelist" format, where User
IDs represents nodes in both columns with the 1st column directing to the 2nd
column (see figure 1 below). Using the AUC to evaluate the performance, I am
looking for the best algorithm to predict the existance of links in the test
data (50% are true and rest are false).
Figure 1:
> training
Vertices: 1133143
Edges: 999
Directed: TRUE
Edges:
[0] 105 -> 850956
[1] 105 -> 1073420
[2] 105 -> 1102667
[3] 165 -> 888346
[4] 165 -> 579649
[5] 165 -> 136665
etc..
I'm having problems obtaining the probability scores for the links / edges as
most of the scores are for the nodes. An example of this is the graph.knn and
page.rank module in igraph.
So my questions are:
1) What do I need to do to obtain the scores for the links instead of the nodes
(I presume it must be a data preparation step that I must be missing out)?
2) Which R package would be the best for running the various techniques -
Jackard index, Adamic-Adar, common neightbours, PropFlow, etc
3) How to implement a supervised learning method such as random forest (I am
guessing I need to obtain a feature list but again, how can I get the scores
for the edges)?
Hope I've explain my questions well but do let me know if more clarification is
need.
Thanks in advance
Eu Jin
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.