Just some thoughts here: (1) Sometimes page links get repeated. I think this is just because page A has N links to page B. This doesn't have much semantic impact, but it does bulk up the files a bit (though less w/ bz2) and makes more work for my importer script (2) Some page links point to articles that don't exist. That's a good thing, because "broken links" are important to the whole wiki concept. Right now my system is ignoring that stuff, but I've done plenty of stuff with link analysis where you can get good insight about a some set of documents S by looking at links to the expanded set of documents S' that includes (real or imagined) documents that are referred to in document S. (3) Might be nice to extract the anchor text together with the link, though then we're not talking about a triple anymore and have to put in some of those dreaded blank nodes... I've been think about training decision rules for a namexer by capturing the text context that pagelinks occur in, but I'd have to write my own extractor to do that.
------------------------------------------------------------------------------ _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
