added to mailing list
---------- Forwarded message ----------
From: Jona Christopher Sahnwaldt <[email protected]>
Date: Thu, Jun 27, 2013 at 9:47 AM
Subject: Re: further Questions for Langlinks extraction
To: Dimitris Kontokostas <[email protected]>
Cc: Hady elsahar <[email protected]>, Sebastian Hellmann <
[email protected]>
Hi Hady, all,
shouldn't we move this discussion to dbpedia-developers?
I think there are several questions:
1. Do we still need all the inter-language link files that we had in 3.8?
2. How do we generate the files that we want?
3. Which URIs do we use?
In 3.8, we generated these files:
---- interlanguage_links_same_as_{language}.{nt,ttl}.bz2
Contains only IL links that go both ways between two languages (which
usually means that the two articles actually are about the same
thing). This is the most important IL link file, we definitely need
it.
In 3.8, we had to analyze all the IL links. Now Wikidata did that for
us. We can generate this file from the Wikidata info.
Example line from interlanguage_links_same_as_en.ttl.bz2 :
<http://dbpedia.org/resource/Australasia>
<http://www.w3.org/2002/07/owl#sameAs>
<http://ar.dbpedia.org/resource/أسترالاسيا> .
Example line from interlanguage_links_same_as_de.ttl.bz2 :
<http://de.dbpedia.org/resource/Australasien>
<http://www.w3.org/2002/07/owl#sameAs>
<http://ar.dbpedia.org/resource/أسترالاسيا> .
---- interlanguage_links_see_also_{language}.{nt,ttl}.bz2
Contains only IL links that do NOT go both ways between two languages
(which usually means that the two articles actually are not really
about the same thing). Not very important, but nice to have, and it's
easy to generate it from Wikipedia.
Again, Wikidata did the job of analyzing the IL links. We can generate
this file from the IL links that are left in Wikipedia pages.
Example line from interlanguage_links_see_also_en.ttl.bz2 :
<http://dbpedia.org/resource/Media>
<http://www.w3.org/2000/01/rdf-schema#seeAlso>
<http://ar.dbpedia.org/resource/وسائط> .
Example line from interlanguage_links_see_also_de.ttl.bz2 :
<http://de.dbpedia.org/resource/Buddhismus>
<http://www.w3.org/2000/01/rdf-schema#seeAlso>
<http://az.dbpedia.org/resource/Qaumata_(Budda)> .
---- interlanguage_links_{language}.{nq,nt,tql,ttl}.bz2
All IL links extracted from Wikipedia articles, using property
http://dbpedia.org/ontology/wikiPageInterLanguageLink. I think we
don't need this file anymore - there are only very few IL links left
in Wikipedia, and all the information in this file can be
reconstructed from the same_as and see_also files.
Example line from interlanguage_links_en.ttl.bz2 :
<http://dbpedia.org/resource/Albedo>
<http://dbpedia.org/ontology/wikiPageInterLanguageLink>
<http://ar.dbpedia.org/resource/بياض> .
---- interlanguage_links_same_as_chapters_{language}.{nt,ttl}.bz2
---- interlanguage_links_see_also_chapters_{language}.{nt,ttl}.bz2
These are versions of the same_as and see_also files that only contain
URIs that can be dereferenced, i.e. only languages that already have a
DBpedia chapter, for example de.dbpedia.org. We should still generate
these. For 3.8, they were generated using ProcessInterLanguageLinks.
Now we'll have to write a new script.
Ok, so much for the old files. Where do we go from here?
In addition, we may produce new files, at least as intermediary steps,
but we may also offer them on the download server.
I think we should produce one large dataset that contains ALL IL links
from Wikidata. We can then produce all other same_as files from this
one, using a Scala script, or even just bash tools.
The extraction process comes down to this:
1. process Wikidata info, generate master IL links file.
2. produce language-specific same_as files from master IL links file,
using a Scala script.
(Not related to Wikidata, so it doesn't concern Hady: 3. produce
see_also files from IL links left in Wikipedia pages.)
Example lines for the master IL links file (using "..." because
subject URIs are not yet decided - see below):
<.../Q64> <http://www.w3.org/2002/07/owl#sameAs>
<http://dbpedia.org/resource/Berlin>
<.../Q64> <http://www.w3.org/2002/07/owl#sameAs>
<http://de.dbpedia.org/resource/Berlin>
Now, the last question is: which URIs do we use?
URIs like http://www.wikidata.org/entity/Q64 are dereferencable, but
apparently not yet in a way suitable for Linked Data, i.e. Wikidata is
not yet handling HTTP redirects and Accept: headers properly.
Sebastian ran a few quick checks during the telco.
We could wait for Wikidata to make their server compatible with Linked
Data. I would guess that it's not a big deal - probably a few Apache
settings. In this case, we could simply use the Wikidata URIs like
http://www.wikidata.org/entity/Q64.
Or we could use our own URIs, because our servers are already
configured for Linked Data. Either
http://data.dbpedia.org/resource/Q64 or something like
http://dbpedia.org/wikidata/Q64. (http://dbpedia.org/data/ is already
used for other stuff.)
Simply using the Wikidata URIs seems preferable.
Cheers,
JC
On 27 June 2013 08:34, Dimitris Kontokostas
<[email protected]> wrote:
> Hi Hady,
>
> here we have 2 options:
>
> 1)
> <http://www.wikidata.org/entity/Q1000> owl:sameAs dbpedia:Gabon
> <http://www.wikidata.org/entity/Q1000> owl:sameAs dbpedia-nl:Gabon
> <http://www.wikidata.org/entity/Q1000> owl:sameAs dbpedia-XX:Gabon
>
> 2)
> dbpedia:Gabon owl:sameAs <http://www.wikidata.org/entity/Q1000>
> dbpedia:Gabon owl:sameAs dbpedia-nl:Gabon
> dbpedia:Gabon owl:sameAs dbpedia-XX:Gabon
>
> IIRC, we decided the second approach, right?
>
>
> On Thu, Jun 27, 2013 at 8:47 AM, Hady elsahar <[email protected]>
wrote:
>>
>> Hi All,
>>
>> i've downloaded 1K files dump and played around with them and the scripts
>> already existing in the extraction framework and i had some Questions
>>
>> what exactly is the format we want for the language links dumps (what
>> exactly should it contain) ? , i downloaded the Langlinks dump for
DBpedia
>> and not sure if we want it to be the same?
>>
>> in dbpedia we can find in the language links file something like
>>
>> <http://als.dbpedia.org/resource/Albedo>
>> <http://fr.dbpedia.org/resource/Albedo>
>> <http://gl.dbpedia.org/resource/Albedo>
>>
>> this doesn't exist in wikidata , it's only
>> <http://www.wikidata.org/entity/Q1000>
>>
>> and the structure in wikidata is a little bit different for example :
>>
>> <http://nl.wikipedia.org/wiki/Gabon> <http://schema.org/about>
>> <http://www.wikidata.org/entity/Q1000> .
>> <http://nl.wikipedia.org/wiki/Gabon> <http://schema.org/inLanguage> "nl"
.
>>
>> <http://en.wikipedia.org/wiki/Gabon> <http://schema.org/about>
>> <http://www.wikidata.org/entity/Q1000> .
>> <http://en.wikipedia.org/wiki/Gabon> <http://schema.org/inLanguage> "en"
.
>>
>> so is it decided how we want the Dump to be , or just we should decide
>> that ?
>>
>>
>> thanks
>> Regards
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile University
>>
>> email : [email protected]
>> Phone : +2-01220887311
>> http://hadyelsahar.me/
>>
>>
>>
>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>
email : [email protected]
Phone : +2-01220887311
http://hadyelsahar.me/
<http://www.linkedin.com/in/hadyelsahar>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers