2010/5/5 Paul Houle <[email protected]>:
> Just some thoughts here:
>
> (1) Sometimes page links get repeated.  I think this is just because
> page A has N links to page B.  This doesn't have much semantic impact,
> but it does bulk up the files a bit (though less w/ bz2) and makes more
> work for my importer script

That can be important to some extent when computing the PageRank of
the wikipedia graph. Or other graph algorithms to mesure the proximity
/ relatedness of entities.

BTW, that would be great if the DBpedia project could compute and
distribute the PageRank or the TunkRank [1] values for the DBpedia
resources based on the data of the page links graph. This is a really
good scoring heuristic when performing fuzzy text named queries with
several homonymic matches.

[1] http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/

> (3) Might be nice to extract the anchor text together with the link,
> though then we're not talking about a triple anymore and have to put in
> some of those dreaded blank nodes...  I've been think about training
> decision rules for a namexer by capturing the text context that
> pagelinks occur in,  but I'd have to write my own extractor to do that.

Or this could be extracted in an adhoc CSV file since I don't really
see the point in having those in a knowlege base / triple store but
this is precious data for training machine learning based NLP models.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to