2010/5/5 Paul Houle <[email protected]>: > Just some thoughts here: > > (1) Sometimes page links get repeated. I think this is just because > page A has N links to page B. This doesn't have much semantic impact, > but it does bulk up the files a bit (though less w/ bz2) and makes more > work for my importer script
That can be important to some extent when computing the PageRank of the wikipedia graph. Or other graph algorithms to mesure the proximity / relatedness of entities. BTW, that would be great if the DBpedia project could compute and distribute the PageRank or the TunkRank [1] values for the DBpedia resources based on the data of the page links graph. This is a really good scoring heuristic when performing fuzzy text named queries with several homonymic matches. [1] http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/ > (3) Might be nice to extract the anchor text together with the link, > though then we're not talking about a triple anymore and have to put in > some of those dreaded blank nodes... I've been think about training > decision rules for a namexer by capturing the text context that > pagelinks occur in, but I'd have to write my own extractor to do that. Or this could be extracted in an adhoc CSV file since I don't really see the point in having those in a knowlege base / triple store but this is precious data for training machine learning based NLP models. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
