Roberto Mirizzi wrote:
>
> We ask search engines through their APIs, looking for some
> co-occurrence-like/google-similarity-distance-like measure between two
> DBpedia resources.
>
>
There are two fun things I've done with "link" data sets.
One of them is that you treat the in-links and out-links of
documents as a 'vector' and compute vector space distances. I've done
this with scientific papers [arxiv.org] When the local fan-in and
fan-out is high, this can be a good way to find 'related documents'.
Developing the NY Pictures site, I generated a list of "topics
related to NYC" and then computed two aggregates over link targets:
(i) incoming links from all other pages,
(ii) incoming links from the NYC topic
Television networks (many of which were based in NYC) were very
strong topics by measure (i) but were much less strong by measure (ii).
The New York Times was still very strong just looking at links 'local'
topics, I think because the NYT was often cited to support WP articles.
Doing the vector space stuff I was talking about above, I found
that the local fan-in and fan-out was critical for it working... In
principle it should work very poorly for scientific papers that are
cited only two or three times [most of them] but it does better than
you'd think, since there's a strong positive correlation between how
some papers get cited [one of mine is always cited together with another
paper that came out in the same issue of Physical Review]
I think looking for link density between categories, spatial
regions, types, etc should be a lot of fun and fruitful since the
statistics are going to be better. Of course, the first thing you see
are the crazy outliers...
------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion