Roberto Mirizzi wrote:
>
> We ask search engines through their APIs, looking for some 
> co-occurrence-like/google-similarity-distance-like measure between two 
> DBpedia resources.
>
>   
    There are two fun things I've done with "link" data sets.

    One of them is that you treat the in-links and out-links of 
documents as a 'vector' and compute vector space distances.  I've done 
this with scientific papers [arxiv.org]  When the local fan-in and 
fan-out is high,  this can be a good way to find 'related documents'.

    Developing the NY Pictures site,  I generated a list of "topics 
related to NYC" and then computed two aggregates over link targets:

(i) incoming links from all other pages,
(ii) incoming links from the NYC topic

    Television networks (many of which were based in NYC) were very 
strong topics by measure (i) but were much less strong by measure (ii).  
The New York Times was still very strong just looking at links 'local' 
topics,  I think because the NYT was often cited to support WP articles.

    Doing the vector space stuff I was talking about above,  I found 
that the local fan-in and fan-out was critical for it working...  In 
principle it should work very poorly for scientific papers that are 
cited only two or three times [most of them] but it does better than 
you'd think,  since there's a strong positive correlation between how 
some papers get cited [one of mine is always cited together with another 
paper that came out in the same issue of Physical Review]

    I think looking for link density between categories,  spatial 
regions,  types,  etc should be a lot of fun and fruitful since the 
statistics are going to be better.  Of course,  the first thing you see 
are the crazy outliers...



------------------------------------------------------------------------------

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to