Just some thoughts here:

(1) Sometimes page links get repeated.  I think this is just because 
page A has N links to page B.  This doesn't have much semantic impact,  
but it does bulk up the files a bit (though less w/ bz2) and makes more 
work for my importer script
(2) Some page links point to articles that don't exist.  That's a good 
thing,  because "broken links" are important to the whole wiki concept.  
Right now my system is ignoring that stuff,  but I've done plenty of 
stuff with link analysis where you can get good insight about a some set 
of documents S by looking at links to the expanded set of documents S' 
that includes (real or imagined) documents that are referred to in 
document S.
(3) Might be nice to extract the anchor text together with the link,  
though then we're not talking about a triple anymore and have to put in 
some of those dreaded blank nodes...  I've been think about training 
decision rules for a namexer by capturing the text context that 
pagelinks occur in,  but I'd have to write my own extractor to do that.

------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to