Re: [DBpedia-discussion] roadmap to the 2016-10 dumps

Dimitris Kontokostas Wed, 30 Aug 2017 23:41:39 -0700

Hello Peter,

some minor comments inline


On Thu, Aug 31, 2017 at 3:39 AM, Peter F. Patel-Schneider <
pfpschnei...@gmail.com> wrote:

> Well what I was trying to do was to figure out just which of the DBpedia
> files I need to combine to get a maximal set of useful high-quality data.
>
> I had thought that this should be easy.  However, it is not.
>
> First there is the problem of getting the file table in the dataset section
> to show up at all.
>
> There is also the question of whether to look in the core directory or the
> core-i18n directory.  I guess that the core-i18n directory is the place to
> because the files in the dataset section of
> http://wiki.dbpedia.org/downloads-2016-10 are all from there.
>
> Then there is the question of whether to use the canonicalized names or the
> localized names.  There are warnings that the files using canonicalized
> names may be missing some information.  But how much information is
> missing?
> Every useful Wikipedia page has a Wikidata item for it so it seems at first
> that there are no missing Wikipedia items.  But then I remembered that
> pages
> with multiple mapped infoboxes will produce multiple DBpedia items, so I
> guess that these are not present.  But how many of these are there?  My
> guess is not many, and the benefits of the canonicalized names outweigh the
> effect of missing some information.
>

What is missing from canonicalized datasets is mainly:
 - redirects (~7.7M resources)
 - the new nlp/nif datasets
 - most of categories-based datasets (last time I checked, most categories
do not have wikidata item) ((total ~1.7M skos concepts)
 - extracted template metadata

Also, as you noted some infoboxes produce multiple intermediate resources
that do not have a wikidata item (~1.7M resources for English)

I think I miss a few but imho these are the most important


> Then there is the question of whether simple or commons is the way to go
> like one of them might have been in the past.  I guess not, because the
> canonicalized names provides better integration.
>
> Then there is the question of whether to use only mapping-based information
> or to include other information.  As I'm interested in high-quality
> information, I chose mapping-based information only.  Then there is the
> question of how to get all the mapping-based information.  My guess is that
> I need "Mappingbased Literals" and "Mappingbased Objects" which should be
> adequate to pick up all the non-instance triples based on their
> descriptions.  However, I guess that I also need "Geo Coordinates
> Mappingbased" but that I don't need "Specific Mappingbased Properties".
> Then I guess that I also need "Instance Types" and "Instance Types
> Transitive".  I also want labels of the information, so I guess I need
> "Labels" and labels_nmw, whereever that is.
>
> Then there is the question of which languages to include.  My guess is all
> of them, as I'm using the canonicalized names and the mapping-based results
> so everything should combine together correctly.  If I get some duplicates
> (e.g., from labels) that should be benign.
>

Note that besides label duplicates you will most probably get a lot of othe
duplication that you need to deal with
e.g. different birthdates between DBpedia EL, NL & Wikidata
https://gist.github.com/jimkont/01f6add8527939c39192bcb3f840eca0

The DBpedia team is working on a fused DBpedia version that will try to
consolidate these differences

So I tried
> wget -nc -r -np --cut-dirs=3 -A
> "*mappingbased*_wkd_*.ttl.bz2","instance_types*wkd_uris_*.
> ttl.bz2","labels_wkd_uris_*.ttl.bz2","*labels_nmw_*.ttl.bz2"
> http://downloads.dbpedia.org/2016-10/core-i18n/
> which seems to do the trick, but I'm not very confidant that I have
> downloaded everything I need.
>

Looks like a good pick but high quality is fittness for use , e.g. as you
said, some people would use the geo data as well.
What you might also consider is the `homepages` and `images` datasets, if
you are interested in those

In addition to that, the dataid metadata contain the suggested named graphs
each dataset (in each language) should go into but is more aligned with the
data that is loaded in the main endpoint

Best,
Dimitris

-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
DBpedia-discussion mailing list
DBpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [DBpedia-discussion] roadmap to the 2016-10 dumps

Reply via email to